Taking the load off the conference chairs: towards 
a digital paper-routing assistant 
David Yarowsky and Radu Florian 
Computer Science Department and Center for Language and Speech Processing 
Johns Hopkins University 
Baltimore, Maryland 21218 
{yarowsky, rflorian}@cs.jhu.edu 
Abstract 
This paper describes and extensively evaluates a sys- 
tem for the automatic routing of submitted papers to 
reviewers and area committees, without the need for 
any human annotation from the reviewers or the pro- 
gram chair. Routing is based on a profile of previous 
writings obtainable on-line for the reviewer pool, a 
generally stable and reusable resource that requires 
no manual adaptation for new submission streams. 
The paper explores a wide set of variations and ex- 
tensions on the core model, and achieves system ac- 
curacy approaching that of several human judges on 
the same task. 
1 Introduction and Problem 
Statement 
Routing submitted papers, abstracts or grant pro- 
posals to qualified reviewers is a central task of the 
academic enterprise, and a remarkably difficult one. 
Typically it is conducted under significant time pres- 
sure in a conference reviewing cycle. As the number 
of submissions and size of the reviewer pool grows, it 
becomes increasingly difficult for a conference chair 
to be familiar with the different expertise of all mem- 
bers of the program committee. It is also difficult 
for one person to master the subtleties of fine sub- 
ject area distinctions as topic diversity in a confer- 
ence becomes large. For these reasons, conferences 
such as ACL (the Association for Computational 
Linguistics) often use a hierarchical program com- 
mittee structure, where submitted papers are first 
routed to area committees, and then more special- 
ized area chairs have the task of assigning papers to 
individual reviewers in the committee. However, in 
a diverse and multidisciplinary field such as natu- 
ral language processing, it is often difficult to define 
clear cut committee descriptions and the program 
chair still must be cognizant of the detailed exper- 
tise of the area committee members in order to route 
atypical or multidisciplinary papers to committees 
with the most appropriate pool of reviewers. The 
low inter-rater consistency results shown in Table 
12 indicate that humans find even area committee 
routing to be a difficult task. 
The following paper focuses on a range of auto- 
mated solutions to this task of routing papers to 
their most appropriate area committee. It presents 
extensive empirical investigation and evaluation of a 
wide range of issues related to this task. 
Previous published research into the problem of 
automatic routing of conference paper submissions is 
surprisingly limited. Approaches to this task can be 
essentially broken down into four major strategies: 
The first strategy is keyword based. Authors are 
required to specify a list of topic/subtopic areas for 
their papers (often from a prespecified term list), 
and reviewers then complete a survey of their rela- 
tive level of expertise on this list of topics/subtopics. 
This approach is followed by AAAI conference re- 
viewing. It suffers from the problem that authors 
often have a difficult time selecting keywords to ad- 
equately describe their work. It works best in con- 
ferences that are very broad, and is least effective in 
more focused workshops where routing distinctions 
in subject area and paradigm are more subtle. 
The second strategy is to build a statistical profile 
of reviewers' expertise by eliciting relevance judg- 
ments on a set of abstract data. AAAI also re- 
quires its reviewers to rank (bid on) submitted ab- 
stracts, and there is currently unpublished work ex- 
ploring the application of supervised routing to the 
ranked reviewer bids on AAAI submitted abstracts 
(Hirsh, personal communication). In groundbreak- 
ing work, Dumais and Nielsen (1992) developed a 
system for the routing of Hypertext'91 abstracts 
using latent semantic indexing (Deerwester et al., 
1990), trained from available text sources including 
a small set of reviewer-submitted abstracts, on-line 
books and ACM articles as a source for the term- 
by-document matrix used in their singular value de- 
compositions. Reviewers manually ranked their in- 
terest in all submitted abstracts, and best perfor- 
mance was achieved when reviewers were assigned 
twice their target number of abstracts and asked to 
choose their preferred half. 
One problem with the modeling of reviewer rank- 
ings/bids is that these may be based more on 
what the reviewer finds interesting rather than what 
220 
Paper 185 
word sense disambiguation with an ensemble of naive Bayesian classifiers 
committee Rough committee characterization 
com4 statistical NLP (focus on sense tagging) 
corn3 statistical NLP (focus on MT, statistical parsing) 
corn6 j generation and systems 
score 
0.377 
0.365 
0.258 
0.242 
0.228 
0.224 
com5 
corn2 
lexicons, some non-statistical sense tagging 
syntax/parsing (mostly non-statistical) 
coml discourse/dialog 
Table 1: Committee routing system output 
he/she is most qualified to review. Even with 
instructions, there may be a natural human ten- 
dency to bid higher on exciting/interesting abstracts 
in a more distant area and possibly bid lower on 
weak/uninteresting papers in the reviewer's core ex- 
pertise. In addition, AAAI reviewers also report this 
as a long and tedious abstract ranking process, that 
shifts the burden of labor for paper routing onto the 
reviewers rather than the program chairs. 
A third option is to learn the reviewer/committee 
profiles by having the program chair assign a por- 
tion of the submissions to reviewers and/or commit- 
tees and then attempt to model these assignments in 
order to compute the assignment of remaining sub- 
missions to the reviewer pool based on these models. 
We evaluate such a strategy below. 
A fourth option, the focus of this paper, is to 
create a statistical profile of reviewers' expertise 
by modeling the collection of their previously pub- 
lished papers and other writings or statements of 
research interest extracted automatically from what 
is available on the web. One advantage of this ap- 
proach is that frees the reviewer from a laborious 
abstract ranking/bidding process. Another is that 
profiles based on a large collection of the reviewer's 
own writings is perhaps a better model of areas of 
demonstrated expertise rather than simply the pa- 
pers a reviewer finds most "interesting". And, fi- 
nally, such profiles based on collected writings tend 
to be relatively stable and reusable from conference- 
to-conference (reviewers often serve on many pro- 
gram committees) and may optionally be updated 
when a reviewer's representative publications grow 
of change significantly. The effort in creating such 
publications-based profiles need not be repeated as 
the pool of submitted abstracts change. This ap- 
proach is remarkably cost effective and the empiri- 
cal results below indicate that it can achieve perfor- 
mance competitive with human paper routers. 
2 Task Description 
The primary task investigated in this paper is the 
routing of full-length submitted conference papers to 
Score 
0.540 
0.426 
0.420 
0.414 
0.369 
0.368 
0.351 
0.344 
0.337 
0.327 
Reviewer Committee 
ng corn4 
bruce com4 
roth com4 
golding com4 
wiebe coml 
resnik com3 
daelemans com4 
shin l com3 
lee corn4 
hang corn5 
Table 2: Reviewer routing system output (for paper 
185, above) 
one of 6 area committees for ACL'99, with the com- 
mittees ranked in order of appropriateness in Table 1 
(actual output of the system on sample paper #185). 
The 6 committees are best defined by their members 
(listed with their committee numbers in Figures 2), 
but they are very roughly characterized in Table 1. 
A secondary task is to provide a proposed ordered 
list of appropriate reviewers, as shown in Table 2. 
Note that this list can be filtered to include just the 
first choice committee, or can include the most ap- 
propriate reviewers independent of committee struc- 
ture. 
2.1 The Data 
The evaluation data used in these experiments con- 
sisted of full-length articles submitted to the gen- 
eral session of ACL'99. Thematic session submis- 
sions were ignored because the reviewing commit- 
tee was preselected by the author in these cases. 
The ACL'99 call for papers included a statement re- 
questing voluntary submission of electronic versions 
of their papers for a paper routing experiment. Of 
the 180 general session authors, 51% (92) partici- 
pated in the study through electronic submission. 
As noted above, electronic copies of representa- 
tive papers were also solicited from members of the 
221 
general session area program committees. Partici- 
pants had the option of including a numeric rank- 
ing (1 to 10) indicating the representativeness of 
the papers with respect to their areas of expertise, 
but few chose to do so. In the numerous cases 
where none or insufficient numbers of papers were 
received from reviewers, their self-selected sample 
of previous publications were augmented by large 
numbers of downloaded reviewer papers from cmp- 
lg (xxx.lanl.gov/cmp-lg), their own home pages and 
the www.cora.jprc.com 1 archive. 
Papers were received and processed from 5 accept- 
able formats: latex, postscript, plain text, portable 
document format (pdf) and html, all of which were 
converted to a marked-up plain text normal for- 
mat. Distinct regions of the papers (title, abstract, 
main body, bibliography) were identified and ex- 
tracted, when possible, in support of differential re- 
gion weighting. 
2.2 Evaluation Methodology 
The primary "gold standard" for evaluation consisted 
of the committee numbers actually assigned to each 
paper by the ACL'99 program chair performing the 
committee routing. These judgments were obtained 
prior to his seeing the results of the automatic rout- 
ing experiments. Because the program chair con- 
sidered other factors including potential conflicts of 
interest in assigning papers, this is not a perfect an- 
notation of the most appropriate committee based 
strictly on mass of reviewer expertise. Three other 
judges (2 NLP faculty members and one 3rd year 
NLP grad student) also routed those papers volun- 
tarily submitted from the authors for the routing 
experiments, with their names, addresses and insti- 
tutions stripped. Greatest committee appropriate- 
ness based on topic and reviewer expertise was the 
sole criterion for these paper assignments. A sec- 
ond evaluation gold standard was obtained from the 
weighted consensus of the 4 reviewers (described in 
Section 7 below). 
The 92 submitted papers were divided into two 
equal halves: a primary test set on which all ma- 
jor results were evaluated, and a secondary devtest 
set, via which some global parameters were esti- 
mated and the one instance of supervised training 
took place. 
Several evaluation measures were used to reflect 
system performance. The first is exact match clas- 
sification accuracy (the percentage of the papers on 
which the gold standard and system agreed exactly 
on the committee assignment). Because the system 
returns a full preferred rank order of the 6 commit- 
tees for all papers, a second natural performance 
measure is the average position of the truth (gold 
IThis is a web search engine specialized in searching Com- 
puter Science related papers (see (McCallum et al., 1999)). 
standard committee selection) in this rank list. This 
measure gives an assessment of how many commit- 
tees the human judge would have to consider, on 
average, before it found the correct classification; 
smaller is better. Because in many cases there are 
two equally viable committee contenders, a third 
measure One-of-best-2 indicates the percentage of 
cases where the gold standard classification is in the 
top two choices ranked by the system. In many 
cases, the whole histogram is given, indicating the 
position of the gold standard classification in the sys- 
tem's committee ranking. 
3 Routing Methodologies 
There are numerous methods described in the infor- 
mation retrieval literature for article routing. As- 
suming that there are n classes and a set of m arti- 
cles, the article routing task attributes each of the 
m articles to one of the n classes. It is clear that 
our task fits well in this paradigm; each paper has 
to go to one committee. The two major approaches 
tested in this model are the standard Salton-style 
vector space model (Salton and McGill, 1983) and 
the Naive Bayes classifier (Mosteller and Wallace, 
1964). 2 These and several permutations and exten- 
sions are detailed and evaluated below. 
3.1 Vector Routing Model 
Unless we specify otherwise, we shall assume that 
the vocabulary is selected by removing a set of com- 
mon (stop) words from the text. Both the submit- 
ted papers and the reviewer papers are represented 
in the space \[0;oo) Irl, as vectors Dr: D~ i = cij .wj, 
where cij is the count (the number of occurrences) 
of the jth word in document D~ and wj is an "im- 
portance" weight associated with the jth word. One 
typical weighting function is IDF (Inverse Document 
w1 log N Frequency): = ,(~ + d~),. where N is the 
total number of documents and docfj is the docu- 
ment frequency of the jth word (the number of doc- 
uments the word appears in). One can measure the 
similarity between 2 documents by using the cosine 
similarity between their vector representations: 
cosine_similarity (Di, Dj) = 
jk 
the dot product of the normalized 3 vectors (see 
(Salton and McGill, 1983)). This measure of sim- 
ilarity yields values close to 1 for similar vectors and 
close to 0 for dissimilar ones. 
2Routing using these and other models is a central 
task in information retrieval, discussed in depth in (Hull, 
1994),(Lewis and Gale, 1994), (Larkey and Croft, 1996) and 
(Voorhees and Harman, 1998) and many other articles. 
311.112 being the Euclidean norm. 
222 
Keywords, 
Baseline Keywords Only Title, 
Abstract Only 
Accuracy: 19.6% 36.9% 52.2% 
Average position: 3.02 2.48 2.28 
One-of-best-2: 50.0% 60.9% 65.2% 
Histogram: 
i:J 
H~lnm V~o~a 
Table 3: Baseline performance measures 
The main algorithm proceeds as follows: 
1. For the ith reviewer (i = 1,...), compute a cen- 
troid Ri - a vector presumably associated with 
the main research interests of the reviewer: 
Rij = E r (P) .cj (P) .wj (1) 
PE~i 
where Pl is the pool of papers for ith reviewer, 
r (P) is the weight/relevance of paper P and 
cj (P) is the word count of jth word in paper P 
(a given word might weight differently in differ- 
ent regions - see region weighting below). 
2. For each committee, compute its centroid as the 
sum of the composing reviewers' centroids: 
= Z (2) 
Ri EC~ 
where Ck is the pool of reviewers for committee 
k. 
3. For each paper, rank all the committees based 
on the cosine similarity between the paper's vec- 
tor and the committee centroids - the one that 
ranks highest is chosen as the classification of 
the paper: 
classification (Pt) = argmax (cosine_similarity (Pl, Ck)) 
k=l...6 (3) 
Table 3 gives results for several basic baseline mod- 
els. Section 2.2 describes these measures. 
Clearly different regions of a paper have different 
importance in determining its semantic context. We 
automatically separate the text into title, abstract, 
keywords, body and bibliography regions and inves- 
tigate different weighting parameters for these re- 
gions. 
The results for full text and region weighting are 
given in Table 5. Consensus evaluation is described 
Router coral corn2 corn3 corn4 corn5 corn6 
1 
corn1 1 0 0 0 0 2 
com2 1 9 1 0 0 3 
com3 0 0 6 2 0 1 
corn4 0 0 4 3 0 2 
com5 0 0 1 0 5 0 
coms 0 1 1 0 1 2 
Table 4: Confusion matrix for the full text, region 
weighting case 
in Section 7. A confusion matrix 4 showing region 
weighting results is given in Table 4. Note that the 
primary confusion is between the difficult to distin- 
gnish committees 3 and 4. 
The remainder of this section describes the modifi- 
cations made to this model, the results we obtained, 
conclusions and explanations of the results. 
3.1.1 Weighting Paper Sources Differently 
As noted before, the reviewers' papers were obtained 
from different sources, with potentially different rel- 
ative indicativeness of a reviewer's expertise. A vari- 
ety of relative weighting parameters for these sources 
were explored on the devtest. None yielded a signif- 
icant improvement over the equally weighted model. 
3.1.2 Term Selection and Weighting 
Experiments were conducted to test the efficacy of 
two variants of IDF (based on the concepts of I doc- 
ument per reviewer and I document per committee), 
entropy-based term weighting, use of stemming, and 
4A cell (i,j) in the confusion matrix shows how many times 
committee i was chosen where committee j was the true as- 
signment. It is an indication of the nature of the misclassifi- 
cation observed, not merely its absolute number. 
223 
Accuracy 
Average position 
One-of-best-2 
Histogram: 
Full Text, 
Equal Weighting 
47.8% 
2.09 
69.6% 
zs~ 
Full Text, 
Region Weighting 
56.5% 
1.96 
71.7%  k+o+ 
Full Text, (RW) 
Consensus 
Evaluation 
67.4% 
1.72 
78.3% 
f++ 
Table 5: Performance on Full Text Routing 
ETitle Abstract \[ B~dy 1 I T°pi~0 I ~3o( 3o \[ \[Bibliography Area 
Table 6: Word Weights Based on Region 
vocabulary selection based on statistically significant 
cross-class frequency variation. No variation outper- 
formed the region weighting model shown in Table 
5. 
3.2 Naive Bayes Classifier 
The naive Bayes model makes an independence as- 
sumption relative to the words in a text. It chooses 
the committee Cj that maximizes the probability 
P (Cj \[P0; formally 
P(cj).P(P, iC~) argmaxj P (Cj IPi) = argmax/ P(PO 
~- argmax/P(Cj) \[Iwkepl P(wklCJ) 
= argmaxj (Iog(P(Cj))+ ~w.eP+ l°g(P(wklCJ))) 
and, furthermore, if one assumes equal a priori probability on the committees 
(P (Cj) = ct), then 
one looks for 
argmax (E log(P(w~\[Cj))) J \wh6Pi 
where the words wk are the target words in the 
article Pi (usually all the non-stopwords). 
One of the issues that need to be addressed when 
considering naive Bayes approaches is smoothing. 
One cannot afford to have null probabilities, as 
they would just nullify the results. The smoothing 
method used in this approach is the simple additive 
smoothing method, that adjusts the maximum like- 
lihood estimates as follows: 
6 + c (Wk, 
P (wklC#) = 6. IVl + g (c#) 
Accuracy: 
Average position: 
One-of-best-2: 
Histogram: 
52.2% 
2.20 
67.4% 
J: 
Table 7: Results for the Naive Bayes Classifier 
where N (Cj) = ~ C (wk, Cj) and 1; is the whole k 
vocabulary. This is a very simple strategy, but we 
believe that it works relatively well for unigrams. 
Results are shown in Table 7; it underperforms the 
region weighted vector-based model with similar pa- 
rameters. 
To check whether unseen words are a problem in 
our case,we varied the parameter 5. Since the results 
were almost the same for 5 values varying from 0.01 
to 1, we conclude that more sophisticated smoothing 
methods (e.g. Good-Turing, Knesser-Ney) would 
not have made a difference, either. 
3.3 Voting 
As an alternative approach to the top-down hier- 
archical routing strategy, we investigated the initial 
direct assignment of papers to reviewers, and then 
allowed the top k reviewers vote for his or her own 
committee. Although optimal performance here was 
slightly lower than for the reference system (46.5%, 
2.22), the gold standard is based on the primacy of 
human committee assignments and have no guar- 
antee that the committee has an adequate num- 
ber of well qualified reviewers. Without the ability 
224 
Accuracy: 
Average position: 
One-of-best-2: 
Histogram: 
Simtezt only Simtezt and Simbib 
= 0.76 
Simbib only 
63.0% 39.1% 
Z=0 
56.5% 
1.96 
71.7% 
T~p~ 
H~m~L 
1.85 2.35 
73.9% 56.5% iL 
Consensus evaluation 
= 0.76 consensus 
69.0% 
1.65 
78.3% 
Table 8: Performance of routing based on bibliographic similarity 
for cross-committee reviewing, a committee with 3 
moderately-well qualified reviewers would probably 
be preferable to a committee with only a single qual- 
ified reviewer but with extremely strong expertise. 
3.4 Routing based on (transitive) 
bibliographic similarity 
Appropriate reviewers for a paper can often be deter- 
mined through analysis of the paper's bibliography. 
Clearly direct citation of a potential reviewer is par- 
tial evidence of that person's suitability to review 
the paper. This relation is also somewhat transi- 
tive, as the authors who cite or are cited by an au- 
thor directly cited in the paper also have increased 
likelihood of being relevant reviewers. 
The goal of this section is to identify tran- 
sitively related authors via chains of the bib- 
liographic relations Cites(authori,authorj) and 
Coauthor(authori, authorj). To estimate these re- 
lations, we automatically extracted and normalized 
bibliographic citations from a large body of on-line 
texts including all of the reviewer-submitted papers. 
Via transitive use of this extensive citation data, 
reviewer-paper similarity could be estimated even 
when there was no direct mention of the reviewer 
in the text to be routed. 
To formalize this approach, let us assume 
that there exists an indexed set of authors 
.A={al,... , an~ }. The reviewers are part of this set; 
let T~ = {rx ... rn. } denote the set of reviewers. We 
also dispose of a set of papers submitted by review- 
ers, P = {Pl,... ,Pn~}. Using the set P we compute 
2 matrices: Cites and Coauthor: 
N Co,,oj) 
Cite.,j = Coau.,or,  = .?(°"°') E E E No o,,o ) 
/¢=1 pET ~ /*=1 
where Np (ai, aj) is the number of times a~ was cited 
in the paper p if ai is an author of p, 0 otherwise, and 
Nc (ai, aj) is the number of papers in which ai and 
a 3 were coauthors identified either from the head of 
Distance d 1 2 3 4 
Accuracy 30.3% 30.4% 32.6% 34.5% 
Average Position 2.28 2.22 2.17 2.17 
One-of-best-2 65.2% I 71.7% 73.9% 71.7% 
Table 9: Performance comparison at different levels 
of parameter d, A = 0.8 and fl = 1, evaluated on 
devtest data 
a paper p 6 ~, or a bibliographic citation extracted 
from p. The relation Cited_by can be captured by 
the transposition of the citation matrix Cites T. 
A symmetric similarity matrix combining these 
base relations is defined as: 
Sire I = (Cites + Cites T) + 
(1 - A) ½ (Coauthor + Coauthor T) (4) 
where A is a weighting factor between the contribut- 
ing sources of similarity. The index 1 (Sim*) denotes 
"direct" (non-transitive) bibliographic similarity. We 
enforced that Sim~i = 1 for all authors i. 
The submitted articles, P,,... , Pnp were routed 
to committees based on similarities between the au- 
thors cited in the paper and the reviewers forming a 
committee: 
1 Sim(PhC~) =~ ~ Sim(Pt,ri) (5) 
flECk 
where 
"" C (Pt, aj) • Sire* (aj,ri) 
j=l 
C (Pl, aj) being the number of times author aj was 
cited in paper Pz- A paper is routed to the commit- 
tee that maximizes the paper/committee similarity 
given in (5). Tuning the parameter A on the training 
set yielded A = 0.8. 
The similarity relation computed in formula (4) is 
very sparse, as a large number of values are 0. To 
compute a more robust similarity, one can consider 
225 
the transitive closure of the graph defined by Sire 1. 
The weights in the resulting graphs are: 
Sirn °° (i,j) = E C(ix...in) 
i = i~,... ,i, = j (6) 
il ~ iv 
where Sire ~ (i,j) is the similarity between the i th 
and jth author. The similarity along one path could 
be any function of the weights of the composing 
links. The one we considered is: 
rL--1 
C (ix... i,) = H Sirn~ (ik, ik+l) 
k=l 
Computing the values in (6) proves to be com- 
putationally expensive, and it appears that extend- 
ing the transitive similarity relationship indefinitely 
may become counterproductive. Therefore, we lim- 
ited the length of the paths involved in computing 
the formula (6): 
d 
Sire d (i, j) = E 
rt~l 
E C (ix... i,~) 
i=ix,... ,in =j 
il ~ iv 
Let us observe that Sirn °° (i,j) = limd~oo Sirn d (i,j), 
hence the name. In Table 9, one can observe that 
the routing performance increases as d increases up 
through a transitive distance of 3, with mixed results 
beyond that point. 
Section 3.4 has, until now, described a routing 
similarity based only on transitive bibliographic ci- 
tation and co-authorship (Simbib). However, rout- 
ing a paper solely on this basis is not optimal as it 
ignores similarity between the the terms in the full 
text (Simte=t), as described in Section 3.1 using re- 
gion weighting. We combined these two measures 
through interpolation: 
S/m (P/, ri) = fl" Simblb (Pl, ri) + 
(1 - ~) Simtext (P. r~) (7) 
On the training set, a value offl = 0.76 was found to 
maximize performance, for d = 3 and the previously 
fixed ,~ = 0.8. 
The full evaluation of the transitive bibliographic 
similarity measure are given in Table 8. Perfor- 
mance using exclusively Simbib (~ = 1) is consid- 
erably lower (39.1%) than the previous best text- 
based similarity (Simtezt) performance of 56.5% ex- 
act match accuracy. However, combining the two 
evidence sources yields a substantially higher rout- 
ing accuracy of 63.0%. This result is also observed 
when evaluating on the consensus gold standard de- 
scribed in Section 7, where combined model accu- 
racy of 69.0% exceeds the Sirnte=t only accuracy of 
67.4%. As shall be shown, for both evaluation stan- 
dards the combined system accuracy rivals that of 
several human judges. 
Accuracy: 
Average position: 
One-of-best-2: 
Histogram: 
52.2% 
2.00 
69.6% 
Table 10: Author-based paper routing 
3.4.1 Routing based exclusively on the 
paper's author 
Prior to now, we have ignored a submitted paper's 
author(s) when making the routing decision. How- 
ever, ACL'99 reviewing was not blind and an inter- 
esting question is what is routing performance when 
classification is based exclusively on the authors' 
identity. Using only Simbib(aUthor, reviewerj) for 
the paper's author(s), exact match accuracy com- 
pletely ignoring the submitted paper (52.5%) ap- 
proaches that of the accuracy using only the submit- 
ted text (56.5%), as shown in Table 10. This sug- 
gests that an author's identity alone is largely suf- 
ficient for routing the paper to the committee most 
appropriate for evaluating her or his work. 
4 Supervised Learning 
The algorithms presented so far are unsupervised; 
the only use for labeled data in the devtest was for 
global parameter optimization. This is a strength of 
the approach presented here, because it can be used 
successfully without any human annotation. In this 
section, we tested the efficacy of training supervised 
models based on initial program chair annotation of 
a portion of the submitted papers. Models of the 
types of papers initially assigned to each committee 
can help select further papers appropriate for that 
committee. Using the vector model, we can define 
the centroid Dij of papers initially routed to a given 
committee as in (2), where Dij = ~P~eC, c (wj, Pk) 
and c (wj, Pk) is the count associated with paper Pk 
and the jth word. Rather than use these models 
in isolation, we combine them with the previously 
described reviewer centroids for each committee Cij 
into C~1 = Cij + A • Dij, where the parameter A 
was optimized in the devtest to be 3. The results 
are presented in Table 11, and outperform the sim- 
ple unsupervised model 60.9% to 56.5%, given initial 
program chair annotation of 1/2 of the data (the de- 
vtest set). 
The updates to the base centroids were made off- 
line in our method; however, this is not required; 
226 
Accuracy: 60.9% 
Average position: 1.98 
One-of-best-2: 
Histogram: 
76.1% 
Table 11: Adaptation to the primary judge partial 
annotation of the data 
once the decision is made (a new paper is routed), 
the "true" label can be used to update the corre- 
sponding centroid. There are numerous methods 
that could be borrowed from AI and IR to imple- 
ment this strategy, including Active Learning (Lewis 
and Gale, 1994). Such online adaptation can maxi- 
mally leverage program chair feedback and minimize 
the need for initial tagged training data. 
5 Automatic Area Committee 
Generation 
In a hierarchical routing system, clearly the composi- 
tion of the committees is crucial. Suboptimal results 
are achieved if the 3 most appropriate reviewers for 
a paper are spread out over different committees. 
As an experiment to see if the committee organi- 
zation could possibly be improved, we investigated 
empirically committee structures using several clus- 
tering strategies. 
In the first test, we generated a hierarchical ag- 
glomerative cluster of the entire reviewer set based 
on the pairwise cosine similarity between their pub- 
lication vectors, using maximal linkage clustering 
(Duda and Hart, 1973; Jain and Dubes, 1988). The 
results are given in Figure 2a, showing the full tree 
and extracted cluster list. The numbers in brackets 
indicate the actual committee assignment of the re- 
viewers; basic inspection will indicate that the de- 
rived clusters correspond closely to existing com- 
mittee compositions (although this information was 
completely ignored in the clustering process). Anal- 
ysis of the substructure in the tree shows a natural 
sub-clustering by research subfocus (e.g. ((isabelle 
(knight (fung wu))) somers)). Inspection will also 
show that people with close research focus are spread 
out among 3 or more different committees, raising 
• some doubts about the optimality of any committee- 
based routing process. 
In another experiment, we tested the extent to 
which committees could more productively be re- 
formed by beginning with the initial committee cen- 
troids and redistributing the reviewers using K- 
means clustering. We used a modified version of 
it to obtain reviewer groups that are balanced in 
size similar to the original committees. This was 
done by limiting the class size to the the maximum 
number of reviewers in an original class; the start- 
ing point of the algorithm was based on the original 
committees. The resulting clusters are shown in Fig- 
ure 2b. The basic initial committee composition is 
preserved, with some outliers reassigned. 
A third experiment was conducted to see if com- 
mittees could be reconstructed to better match the 
committee assignment of papers as proposed by 
the program chair. Specifically, we "reversed" the 
routing problem by computing committee centroids 
based on the set of submitted papers assigned to the 
committee by the program chair, and then routed 
the reviewers to the committees as if each reviewer 
was an abstract. In this case, we did not impose any 
restriction on committee size. The results are shown 
in Figure 2c. One can still see the original commit- 
tees in the new organization; the fact that the third 
committee is large (21 reviewers, almost one third of 
the whole population) can be probably explained by 
the fact that the papers routed to committee 3 were 
interdisciplinary, therefore they had a lot in common 
with many reviewers. 
Another meaningful measure for clustering is the 
Simbib (authori, authorj) based on transitive bibli- 
ographic citation and co-authorship (Section 3.4). 
Figure 2d shows the results of applying maximum 
linkage agglomerative clustering to this similarity 
measure. This also shows some correlation with the 
manually chosen committees. 
Finally, it is readily noted by the human judges 
that certain committees (such as 3 and 4) were quite 
similar and difficult to distinguish. We can use ag- 
glomerative hierarchical clustering of our committee 
profile centroids to achieve some measure of relative 
committee distance. The following tree confirms hu- 
man intuition regarding committee similarity: 
com5 
corn2 
COII~ 
~ com4 
l coml --com6 
One application of this tree and associated dis- 
tances is to weight the cost of committee misassign- 
ments by the severity of the error. The majority of 
the system errors noted in Table 4 are between (3,4) 
and (1,6), which this empirical clustering would in- 
dicate are relatively low cost mistakes. 
227 
~!~_~-. . ~e|\[~\] 
~.~.~..;;_~' • kurohashi\]2\] fillmore\]5\[ fellhaum\[g\] grefenstette\[5\] cal~olari\[5\] par•Ill\]5\[ 
~:...~-- • tsujii\[3\] bangalore\[2\] manning\]2\[ briscoe\[3\] charniak\[4\] bouma\[2\] johnson\]2\[ lin\[4\[ 
~.~'~'=.:..~ • mohri\[3\] chen\[4\] mat .... to\]3\[ wiebe\]l\], bruce\]4\[ shln\[3\] lee\[4\[ hang\]5\[ 
=r..:.':'-. * tanaka\]3\[ ng\[4\] daelemans\[4\] golding\[4\] roth\[4\] pedro•r\]5\[ resnlk\[3\] rock•own\]6\[ 
, ~ • isabelle\[3\] knight\]3\[ lung\]3\[ wu\[4\] \[6\] ...... 
~01 * kaplan\[2\] cristea\[1\] satta\[2\] rogers\]2\[ becket\]2\[ welr\]2\] 
• stone\]l\[ dieugenio\[1\] ash•rill poesio\[1\] bush\]f\] 
:.... * rayner\[3\] mccoy\[6\[ linden\]6\[ paris\]6\[ milosv, vljevic\[6\] oberlander\[6\] hahn\[l\[ bat•man\]6\[ bus•mann\]6\[ elhadad\]6\] rambow\[6\[ 
• steedman\[2\[ green\]l\[ moore\]l\[ earberry\[6\] hirst\[6\] sidner\[l\] zukerman\]6\] 
r obtained by average-linkage agglomerative clustering of reviewer papers 
• kcoml ash•rill busa\[5\[ calzo|ari\[5\] eristea\[1\] dieugenio\[1\] hahn\]l\[ paris\]6\[ poesio\[l\[ sidner\[1\] stone\]l\[ wiebe\[11 
• kcorng bffingalore\[2\] becker\[2\] bouma\[2\] johnson\[2\] lin\[4\[ manning\]2\] mohri\[3\] rogers\]2\] satta\[2\] tsujii\]3\] weir\]2\] 
• kcom3 briscoe\[3\] che~rniak\[4\] fung\]3\] ieabelle\]3\] knight\]3\[ matsumoto\[3\] rock•own\[6\] rayner\[3\] resnlk\[3\[ shin\]3\[ wu\]4\] 
• kcorn4 bruce\]4\[ chen\[4\] daelemans\[4\] golding\[4\] lee\]4\] ng\[4\[ ratnaparkhi\[4\] roth\[4\] 
• k©omg bel\]5\] fellbaum\[5\] fillmore\]5\[ grefenstette\[8\] hang\[5\] kaplan\]2\] kurohashi\[2) palmer\]5\] pirelli\[5\[ sos•re\]6\[ tanaka\]3\] 
• kcom6 batemsn\[6\] bus•mann{6\[ carberry\[6\[ elhadad\[6\] green\]l} hirst\[6\] linden\]6\[ mccoy\]6\[ milosavljevic\[6\] moore\]l\[ oberlander\[6\] rambow\[6\] steedman\[2\[ zukerman\[6\] 
Figure 2b: Committees obtained by k-means reclustering of initial committees 
• rcoml ssher\[l\] cristea\[1\] dieugenio\[l\] green\]l\[ moore\]l\[ poe•loll\] sidner\[l\] steedman\]2\] pirel|i\[5\] carberry\[6\] hlrst\[6\] 
• r©om2 hahn\]l\[ becker\[2\] bouma\[2\] johnson\]2\[ mannlng\[2\] rogers\]2\[ satta\]2\] weir\]2\[ briscoe\]3\[ tsujii\[3\] lln\[4\] 
• room3 wiebe\[l\] bangaIore\[2\] isabe||e\]3\] knight\[3\] matsumoto\]3\] mohri\[3\] rayner\[3\] shin\[3\] bruce\]4\[ chen\[4\] daelemans\[4\] Eolding\[4\[ lee\]4\[ ratnaparkhi\[4\] 
• room4 funs\[3\[ resnik\[3\] tsnaka\[3\[ charniak\[4\] ng\[4\] hang\]5\[ mckeown\[6\] 
• rcomg kurohashi\[2\[ fellbaum\[5\] finmore\[5\] grefenstette\[5\] 
• room6 kaplan\[2\] bel\[5\] busa\[5\] batem~n\[6\] bus•mann\]6\[ elhadad\[6\] linden\]6\] milosavljevic\[6\] oberlander\[6\] paris\]6\[ ramhow\]6\] sukerman\[6\[ 
Figure 2c: Committees obtained by reverse routing reviewers to the centroids of assigned papers 
..... ~I • kurohashi\[2\] mckeown\[6\] bell5\[ 
~~.~i~g'~'~: • mi! .... ljevic\[6\] oberlander\[61 bat .... \[6\] ..... \[I\] linden\]6\] paris\]6\[ .... Y\[61 auk ..... 161 -- ~-~V.~)~ ~-~'., 
~ :'~":~:~-~-==~.~r~ ~, isabel|e\]3\] sos•re\]6\] elhadad\[6\[ knight\]3\] bu .... nn\[6\] rayner\[3\] bus~\[5\] kaplan\[2\] 
" ~;~: :~'j-~, • tanv, ka\[3\] tsujii\[3\] mEtnning\[2\] briscoe\[3\] parr•Ill\]5\[ grenfenstette\[5\] ca|solari\[5\] • H=z,i ~=T:.%~_,, 
• ' __ ~! :~_~'~l--., • lee\]4\] li\]5\] m~tsumoto\[3\] mohri\[3\] bangalore\]2\] ratnaparkhi\[4\] ng\[4\] wiebe\[1\] bruce\]4\[ :' ~ .=~H~T~T~;~!~.~ fungi3\[ wu\[4\] chang\]3\] golding\[4\] roth\[4\] chen\[4\] 
---I , ~ ~'~-- :~'-" • fellbaum\[5\] flnmore\[5\] resnik\[2\] sstta\]2\[ bouma\[2\] lin\[4\] johnson\]2\[ charniak\[4\] rogers\]2\] __ i -- \[~ ~\[~ ~' weir\]2\[ hahn\[l\[ dae|emans\[4\[ 
• __ k~:-~ ~ ~., 
poesio\]l\] hit•t\]6\[ green\]l\[ carberry\[O\] pa|mer\[5\[ sidner\[l\] steedman\[2\] atone\[l\[ L, ~i!~\] ~i~i ~'~: • asher\[1\]crl .... \[1\] di eugenio\[1\] 
Figure 2d: Reviewer clusters based on agglomerative clustering using bibliographical similarity 
6 System Usage and Confidence 
Measures for Routing 
The routing algorithms presented here have two nat- 
ural modes of application. The system's commit- 
tee recommendations can be used either for post- 
hoc routing error identification (as a sanity check) 
or for pre-hoc initial automatic assignment with hu- 
man verification 5 . The latter strategy requires some 
measure of system confidence for optimal applica- 
tion. Such a measure would help a human judge 
minimize the time spent in performing the task. If 
the system is very confident, one might even decide 
to accept the decision without careful review. On 
the other hand, in cases where the system is not 
confident, full attention is required. 
Based on the ranked output of the system, we 
5The former strategy was actually employed in ACL'99 
reviewing. 
searched for feature transformations whose output 
can be used in determining confidence intervals. A 
reasonable one is 5 = =1-=2 where the Xl and x2 =1 
are the scores associated with the first and second 
choices of the system. A plot of the averaged ac- 
curacy of this operator is depicted in Figure 1 (the 
value interval was divided in 10 equal and partially 
overlapping bins and average accuracy was com- 
puted on each one of them). The graph on the 
right shows the accuracy in the case where ranking 
the gold standard as the system's second committee 
choice is not considered an error. 
One conclusion that can be drawn from the plots 
is that one can be relatively confident in the sys- 
tem classification if the value of 6 is above the 0.25 
threshold, while 6 < 0.1 tends to indicate lowest ex- 
pected accuracy and greatest need for careful human 
inspection. Such confidence measures may also be 
228 
1 
o.a 
~ o.4 
-+~4 +÷ +~ ÷++ *÷÷ ~÷ 
o.~, o:, o.', o~ o;, o:, o.~ o'., 
Relative Difference Between First 2 Scores 
o.~ o.~ 
"O o.s 
P. 
"~ o.e 
8 
¢q 
o.4 
o 
~ o2J 
U < o 
÷ ~ ~+ 
o.~ o'., o.', o~ o.~ o:, o.h oi, Relative Difference Between First 2 Scores o~ 
Figure 1: Measures of confidence 
used in posthoc correction of human assignments to 
rank the most likely human errors for re-inspection. 
7 Human Performance and 
Consensus Generation 
Area committee routing is a difficult task for hu- 
mans. Table 12 shows the relatively low inter-judge 
agreement rates for the 4 judges mentioned in Sec- 
tion 2.2 when annotating the 46-word primary test 
set. Judge 1 (the program chair) had a slightly dif- 
ferent objective function for routing (including the 
avoidance of conflicts of interest and perhaps some 
committee size balancing), explaining slightly lower 
agreement rates than that between the two faculty 
members (Judges 2 and 3) who had the same task 
description of finding the most appropriate commit- 
tee without constraint. Judge 4 was a knowledgeable 
but less experienced 3rd year graduate student, and 
his lower performance relative to his colleagues may 
have been due to more limited familiarity with the 
reviewers and their expertise. 
In order to improve the quality of the gold stan- 
dard, a consensus standard was generated by taking 
the majority vote of Judges 1-3. In case of a tie, 
the program chair was used as the definitive assign- 
ment. In nearly 80% of the data, the consensus was 
identical to the program chair's assignment. 
Table 13 illustrates the performance of Judge 4 
and the reference Systems (Section 3.1 and 3.4) for 
both the Judge 1 and Consensus gold standards. 
Both Judge 4 and the System agreed substantially 
more with the consensus than the Judge 1 standard, 
providing some evidence for the relative merit of the 
consensus standard. The most interesting result, 
however, is that the system performed better than 
the graduate student Judge 4 for both standards (al- 
though generally lower than the performance of the 
more experienced faculty members). This suggests 
that system performance, by virtue of its inherently 
much greater familiarity with the publications and 
Judge1 Judge2 Judge3 Judge4 
Judgel 100 60.9 65.2 45.6 
Judge2 60.9 100 73.9 47.8 
Judge3 65.2 73.9 100 52.2 
Judge4 45.6 47.8 52.2 100 
Table 12: Human judge agreement 
hence the expertise of the reviewers, more than com- 
pensates for its rather limited skills at generalization 
and inference. This would suggest that the proposed 
algorithm may be as effective (or even more effec- 
tive than) human paper routers except for the most 
knowledgeable human judges. 
The final observation is that in cases where there 
is high agreement among the human judges, system 
routing accuracy is also very high. Table 14 divides 
the data by thresholds of minimum agreement be- 
tween the Judges 1-3, as the primary partitioning 
principle using the Section 3.1 system without the 
Simbib extension. Given a certain level of agree- 
ment (e.g. all 3 judges agree), it's also useful to 
consider whether the 4th Judge agreed or not with 
that consensus. By giving the less-experienced 4th 
Judge an effective 1/2 vote, further refinement in 
the granularity of consensus can be obtained with- 
out effectiving the primacy of the votes of Judges 
1-3. In the 57% of the data where only the first 3 
judges agree, system accuracy exceeds 80%. In the 
most confidently classified 35% of the data where 
all 4 judges agree, system accuracy approaches 88% 
and in 100% of these cases the consensus commit- 
tee was one of the system's top two choices. These 
results strongly suggest that in the clear-cut cases 
where humans consistently agree on a classification, 
system performance is very reliable too. The large 
bulk of system "errors" are in cases where humans 
tend to disagree as well. 
229 
Judgel Judge2 Judge3 Judge4 System 3.1 
Judge1 assignment = Truth 100 60.9 65.2 45.6 56.5 
Consensus = Truth i 78.3 82.6 82.6 56.5 67.4 
Table 13: Human judge and system agreement with 2 goldstandards 
System 3.4 
63.0 
69.0 
Minimum 
Agreement of data 
1 100 
1.5 98 
2 87 
2.5 72 
3 57 
3.5 35 
System 
Accuracy 
67.4% 
68.8% 
75.0% 
78.8% 
8O.8% 
87.5% 
Average 
Position 
1.72 
1.64 
1.38 
1.30 
1.23 
1.12 
One-of-2-best 
78.3% 
80.0% 
87.5% 
90.9% 
96.2% 
100.0% 
Table 14: Routing results given levels of minimum human agreement on committee assignment 
8 Conclusions 
This paper has presented and extensively evaluated 
a class of algorithms for automatic routing of sub- 
mitted papers to reviewers and area committees, 
without the need for any human annotation from 
the reviewers or the program chair. Routing is 
based on a profile of previous writings obtainable 
on-line for the reviewer pool, a generally stable and 
reusable resource that requires no manual adapta- 
tion for new submission streams. The paper ex- 
plored a wide set of variations and extensions on the 
core model, and system accuracy approaches or ex- 
ceeds that of human judges on the same task. This 
research demonstrates that such automated paper 
routing techniques may have merit for paper rout- 
ing for future conferences, especially those with rel- 
atively large and diverse program committees where 
it is difficult for one person to be familiar with the 
full range of expertise of all committee members. 
9 Acknowledgements 
Many thanks are owed to Ken Church and Robert 
Dale, ACL'99 Program Committee co-Chairs, the 
ACL'99 area chairs, and the participating review- 
ers and authors for their essential support of this 
project. It is hoped that insights gained in this fea- 
sibility study will be useful for future program chairs 
deciding if, where and how to utilize automatic rout- 
ing strategies in support of their committee and re- 
viewer assignment tasks. This research was partially 
supported by the National Science Foundation grant 
IRI-9618874. 

References 

S. Deerwester, S. Dumals, T. Landauer, G. Furnas, 
and R. Harsham. 1990. Indexing by latent seman- 
tic analysis. Journal of the Society for Informa- 
tion Science, 41(6):391-407. 

Richard O. Duda and Peter E. Hart. 1973. Pattern 
Classification and Scene Analysis. John Wiley. 

Susan Dumals and Jakob Nielsen. 1992. Automat- 
ing the assignment of submitted manuscripts to 
reviewers. In Proceedings of SIGIR '92, pages 
233-244, Copenhagen, Denmark. 

Haym Hirsh. Personal communication. 

D. Hull. 1994. Improving text retrieval for the rout- 
ing problem using latent semantic indexing. In 
Proceedings of SIGIR '94, pages 282-291, New 
York. 

Anil K. Jain and Richard C. Dubes. 1988. Algo- 
rithms for Clustering Data. Prentice Hall. 

L. Larkey and W.B. Croft. 1996. Combining classi- 
tiers in text categorization. In Proceedings of SI- 
GIR '96. 

D. Lewis and W. Gale. 1994. A sequential algo- 
rithm for training text classifiers. In Proceedings 
of the Seventeenth Annual International ACM- 
SIGIR Conference on Research and Development 
in Information Retrieval, pages 3-12, Dublin. 

Andrew McCallum, Kamal Nigam, Jason Rennie, 
and Kristie Seymore. 1999. Building domain- 
specific search engines with machine learning tech- 
niques. In Proceedings of the AAAI Spring Sym- 
posium on Intelligent Agents in Cyberspace. 

F. Mosteller and D. Wallace. 1964. Inference and 
Disputed Authorship: The Federalist. Addison- 
Wesley, Reading, Massachusetts. 

G. Salton and M. McGill. 1983. An Introduc- 
tion to Modern Information Retrieval. New York, 
McGraw-Hill. 

E. Voorhees and D. Harman. 1998. Overview of 
the 6th text retrieval conference (trec-6). In Pro- 
ceedings of the Sixth Text REtrieval Conference 
(TREC-6). NIST Special Publication, 500-240. 
