File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0627_metho.xml
Size: 30,089 bytes
Last Modified: 2025-10-06 14:15:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0627"> <Title>Taking the load off the conference chairs: towards a digital paper-routing assistant</Title> <Section position="3" start_page="220" end_page="221" type="metho"> <SectionTitle> 2 Task Description </SectionTitle> <Paragraph position="0"> The primary task investigated in this paper is the routing of full-length submitted conference papers to 185, above) one of 6 area committees for ACL'99, with the committees ranked in order of appropriateness in Table 1 (actual output of the system on sample paper #185). The 6 committees are best defined by their members (listed with their committee numbers in Figures 2), but they are very roughly characterized in Table 1. A secondary task is to provide a proposed ordered list of appropriate reviewers, as shown in Table 2. Note that this list can be filtered to include just the first choice committee, or can include the most appropriate reviewers independent of committee structure. null</Paragraph> <Section position="1" start_page="220" end_page="221" type="sub_section"> <SectionTitle> 2.1 The Data </SectionTitle> <Paragraph position="0"> The evaluation data used in these experiments consisted of full-length articles submitted to the general session of ACL'99. Thematic session submissions were ignored because the reviewing committee was preselected by the author in these cases.</Paragraph> <Paragraph position="1"> The ACL'99 call for papers included a statement requesting voluntary submission of electronic versions of their papers for a paper routing experiment. Of the 180 general session authors, 51% (92) participated in the study through electronic submission.</Paragraph> <Paragraph position="2"> As noted above, electronic copies of representative papers were also solicited from members of the general session area program committees. Participants had the option of including a numeric ranking (1 to 10) indicating the representativeness of the papers with respect to their areas of expertise, but few chose to do so. In the numerous cases where none or insufficient numbers of papers were received from reviewers, their self-selected sample of previous publications were augmented by large numbers of downloaded reviewer papers from cmp-lg (xxx.lanl.gov/cmp-lg), their own home pages and the www.cora.jprc.com 1 archive.</Paragraph> <Paragraph position="3"> Papers were received and processed from 5 acceptable formats: latex, postscript, plain text, portable document format (pdf) and html, all of which were converted to a marked-up plain text normal format. Distinct regions of the papers (title, abstract, main body, bibliography) were identified and extracted, when possible, in support of differential region weighting.</Paragraph> </Section> <Section position="2" start_page="221" end_page="221" type="sub_section"> <SectionTitle> 2.2 Evaluation Methodology </SectionTitle> <Paragraph position="0"> The primary &quot;gold standard&quot; for evaluation consisted of the committee numbers actually assigned to each paper by the ACL'99 program chair performing the committee routing. These judgments were obtained prior to his seeing the results of the automatic routing experiments. Because the program chair considered other factors including potential conflicts of interest in assigning papers, this is not a perfect annotation of the most appropriate committee based strictly on mass of reviewer expertise. Three other judges (2 NLP faculty members and one 3rd year NLP grad student) also routed those papers voluntarily submitted from the authors for the routing experiments, with their names, addresses and institutions stripped. Greatest committee appropriateness based on topic and reviewer expertise was the sole criterion for these paper assignments. A second evaluation gold standard was obtained from the weighted consensus of the 4 reviewers (described in Section 7 below).</Paragraph> <Paragraph position="1"> The 92 submitted papers were divided into two equal halves: a primary test set on which all major results were evaluated, and a secondary devtest set, via which some global parameters were estimated and the one instance of supervised training took place.</Paragraph> <Paragraph position="2"> Several evaluation measures were used to reflect system performance. The first is exact match classification accuracy (the percentage of the papers on which the gold standard and system agreed exactly on the committee assignment). Because the system returns a full preferred rank order of the 6 committees for all papers, a second natural performance measure is the average position of the truth (gold IThis is a web search engine specialized in searching Computer Science related papers (see (McCallum et al., 1999)). standard committee selection) in this rank list. This measure gives an assessment of how many committees the human judge would have to consider, on average, before it found the correct classification; smaller is better. Because in many cases there are two equally viable committee contenders, a third measure One-of-best-2 indicates the percentage of cases where the gold standard classification is in the top two choices ranked by the system. In many cases, the whole histogram is given, indicating the position of the gold standard classification in the system's committee ranking.</Paragraph> </Section> </Section> <Section position="4" start_page="221" end_page="225" type="metho"> <SectionTitle> 3 Routing Methodologies </SectionTitle> <Paragraph position="0"> There are numerous methods described in the information retrieval literature for article routing. Assuming that there are n classes and a set of m articles, the article routing task attributes each of the m articles to one of the n classes. It is clear that our task fits well in this paradigm; each paper has to go to one committee. The two major approaches tested in this model are the standard Salton-style vector space model (Salton and McGill, 1983) and the Naive Bayes classifier (Mosteller and Wallace, 1964). 2 These and several permutations and extensions are detailed and evaluated below.</Paragraph> <Section position="1" start_page="221" end_page="222" type="sub_section"> <SectionTitle> 3.1 Vector Routing Model </SectionTitle> <Paragraph position="0"> Unless we specify otherwise, we shall assume that the vocabulary is selected by removing a set of common (stop) words from the text. Both the submitted papers and the reviewer papers are represented in the space \[0;oo) Irl, as vectors Dr: D~ i = cij .wj, where cij is the count (the number of occurrences) of the jth word in document D~ and wj is an &quot;importance&quot; weight associated with the jth word. One typical weighting function is IDF (Inverse Document w1 log N Frequency): = ,(~ + d~),. where N is the total number of documents and docfj is the document frequency of the jth word (the number of documents the word appears in). One can measure the similarity between 2 documents by using the cosine similarity between their vector representations:</Paragraph> <Paragraph position="2"> the dot product of the normalized 3 vectors (see (Salton and McGill, 1983)). This measure of similarity yields values close to 1 for similar vectors and close to 0 for dissimilar ones.</Paragraph> <Paragraph position="3"> 2Routing using these and other models is a central task in information retrieval, discussed in depth in (Hull, 1994),(Lewis and Gale, 1994), (Larkey and Croft, 1996) and (Voorhees and Harman, 1998) and many other articles.</Paragraph> <Paragraph position="4"> 311.112 being the Euclidean norm.</Paragraph> <Paragraph position="5"> The main algorithm proceeds as follows: 1. For the ith reviewer (i = 1,...), compute a centroid Ri - a vector presumably associated with the main research interests of the reviewer:</Paragraph> <Paragraph position="7"> (a given word might weight differently in different regions - see region weighting below).</Paragraph> <Paragraph position="8"> 2. For each committee, compute its centroid as the sum of the composing reviewers' centroids:</Paragraph> <Paragraph position="10"> where Ck is the pool of reviewers for committee k.</Paragraph> <Paragraph position="11"> 3. For each paper, rank all the committees based on the cosine similarity between the paper's vector and the committee centroids - the one that ranks highest is chosen as the classification of the paper:</Paragraph> <Paragraph position="13"> Table 3 gives results for several basic baseline models. Section 2.2 describes these measures.</Paragraph> <Paragraph position="14"> Clearly different regions of a paper have different importance in determining its semantic context. We automatically separate the text into title, abstract, keywords, body and bibliography regions and investigate different weighting parameters for these regions. null The results for full text and region weighting are given in Table 5. Consensus evaluation is described Router coral corn2 corn3 corn4 corn5 corn6 weighting case in Section 7. A confusion matrix 4 showing region weighting results is given in Table 4. Note that the primary confusion is between the difficult to distingnish committees 3 and 4.</Paragraph> <Paragraph position="15"> The remainder of this section describes the modifications made to this model, the results we obtained, conclusions and explanations of the results.</Paragraph> </Section> <Section position="2" start_page="222" end_page="223" type="sub_section"> <SectionTitle> 3.1.1 Weighting Paper Sources Differently </SectionTitle> <Paragraph position="0"> As noted before, the reviewers' papers were obtained from different sources, with potentially different relative indicativeness of a reviewer's expertise. A variety of relative weighting parameters for these sources were explored on the devtest. None yielded a significant improvement over the equally weighted model.</Paragraph> <Paragraph position="1"> Experiments were conducted to test the efficacy of two variants of IDF (based on the concepts of I document per reviewer and I document per committee), entropy-based term weighting, use of stemming, and</Paragraph> </Section> <Section position="3" start_page="223" end_page="223" type="sub_section"> <SectionTitle> 3.2 Naive Bayes Classifier </SectionTitle> <Paragraph position="0"> The naive Bayes model makes an independence assumption relative to the words in a text. It chooses the committee Cj that maximizes the probability</Paragraph> <Paragraph position="2"> and, furthermore, if one assumes equal a priori probability on the committees</Paragraph> <Paragraph position="4"> one looks for argmax (E log(P(w~\[Cj))) J \wh6Pi where the words wk are the target words in the article Pi (usually all the non-stopwords). One of the issues that need to be addressed when considering naive Bayes approaches is smoothing. One cannot afford to have null probabilities, as they would just nullify the results. The smoothing method used in this approach is the simple additive smoothing method, that adjusts the maximum likelihood estimates as follows: where N (Cj) = ~ C (wk, Cj) and 1; is the whole k vocabulary. This is a very simple strategy, but we believe that it works relatively well for unigrams. Results are shown in Table 7; it underperforms the region weighted vector-based model with similar parameters. null To check whether unseen words are a problem in our case,we varied the parameter 5. Since the results were almost the same for 5 values varying from 0.01 to 1, we conclude that more sophisticated smoothing methods (e.g. Good-Turing, Knesser-Ney) would not have made a difference, either.</Paragraph> </Section> <Section position="4" start_page="223" end_page="224" type="sub_section"> <SectionTitle> 3.3 Voting </SectionTitle> <Paragraph position="0"> As an alternative approach to the top-down hierarchical routing strategy, we investigated the initial direct assignment of papers to reviewers, and then allowed the top k reviewers vote for his or her own committee. Although optimal performance here was slightly lower than for the reference system (46.5%, 2.22), the gold standard is based on the primacy of human committee assignments and have no guarantee that the committee has an adequate number of well qualified reviewers. Without the ability for cross-committee reviewing, a committee with 3 moderately-well qualified reviewers would probably be preferable to a committee with only a single qualified reviewer but with extremely strong expertise.</Paragraph> </Section> <Section position="5" start_page="224" end_page="225" type="sub_section"> <SectionTitle> 3.4 Routing based on (transitive) </SectionTitle> <Paragraph position="0"> bibliographic similarity Appropriate reviewers for a paper can often be determined through analysis of the paper's bibliography. Clearly direct citation of a potential reviewer is partial evidence of that person's suitability to review the paper. This relation is also somewhat transitive, as the authors who cite or are cited by an author directly cited in the paper also have increased likelihood of being relevant reviewers.</Paragraph> <Paragraph position="1"> The goal of this section is to identify transitively related authors via chains of the bibliographic relations Cites(authori,authorj) and Coauthor(authori, authorj). To estimate these relations, we automatically extracted and normalized bibliographic citations from a large body of on-line texts including all of the reviewer-submitted papers.</Paragraph> <Paragraph position="2"> Via transitive use of this extensive citation data, reviewer-paper similarity could be estimated even when there was no direct mention of the reviewer in the text to be routed.</Paragraph> <Paragraph position="3"> To formalize this approach, let us assume that there exists an indexed set of authors .A={al,... , an~ }. The reviewers are part of this set; let T~ = {rx ... rn. } denote the set of reviewers. We also dispose of a set of papers submitted by review-</Paragraph> <Paragraph position="5"> where Np (ai, aj) is the number of times a~ was cited in the paper p if ai is an author of p, 0 otherwise, and Nc (ai, aj) is the number of papers in which ai and a 3 were coauthors identified either from the head of of parameter d, A = 0.8 and fl = 1, evaluated on devtest data a paper p 6 ~, or a bibliographic citation extracted from p. The relation Cited_by can be captured by the transposition of the citation matrix Cites T. A symmetric similarity matrix combining these base relations is defined as:</Paragraph> <Paragraph position="7"> where A is a weighting factor between the contributing sources of similarity. The index 1 (Sim*) denotes &quot;direct&quot; (non-transitive) bibliographic similarity. We enforced that Sim~i = 1 for all authors i.</Paragraph> <Paragraph position="8"> The submitted articles, P,,... , Pnp were routed to committees based on similarities between the authors cited in the paper and the reviewers forming a committee:</Paragraph> <Paragraph position="10"> cited in paper Pz- A paper is routed to the committee that maximizes the paper/committee similarity given in (5). Tuning the parameter A on the training set yielded A = 0.8.</Paragraph> <Paragraph position="11"> The similarity relation computed in formula (4) is very sparse, as a large number of values are 0. To compute a more robust similarity, one can consider the transitive closure of the graph defined by Sire 1. The weights in the resulting graphs are:</Paragraph> <Paragraph position="13"> where Sire ~ (i,j) is the similarity between the i th and jth author. The similarity along one path could be any function of the weights of the composing links. The one we considered is:</Paragraph> <Paragraph position="15"> Computing the values in (6) proves to be computationally expensive, and it appears that extending the transitive similarity relationship indefinitely may become counterproductive. Therefore, we limited the length of the paths involved in computing the formula (6): d Sire d (i, j) = E</Paragraph> <Paragraph position="17"> Let us observe that Sirn degdeg (i,j) = limd~oo Sirn d (i,j), hence the name. In Table 9, one can observe that the routing performance increases as d increases up through a transitive distance of 3, with mixed results beyond that point.</Paragraph> <Paragraph position="18"> Section 3.4 has, until now, described a routing similarity based only on transitive bibliographic citation and co-authorship (Simbib). However, routing a paper solely on this basis is not optimal as it ignores similarity between the the terms in the full text (Simte=t), as described in Section 3.1 using region weighting. We combined these two measures</Paragraph> <Paragraph position="20"> On the training set, a value offl = 0.76 was found to maximize performance, for d = 3 and the previously fixed ,~ = 0.8.</Paragraph> <Paragraph position="21"> The full evaluation of the transitive bibliographic similarity measure are given in Table 8. Performance using exclusively Simbib (~ = 1) is considerably lower (39.1%) than the previous best text-based similarity (Simtezt) performance of 56.5% exact match accuracy. However, combining the two evidence sources yields a substantially higher routing accuracy of 63.0%. This result is also observed when evaluating on the consensus gold standard described in Section 7, where combined model accuracy of 69.0% exceeds the Sirnte=t only accuracy of 67.4%. As shall be shown, for both evaluation standards the combined system accuracy rivals that of several human judges.</Paragraph> <Paragraph position="22"> paper's author Prior to now, we have ignored a submitted paper's author(s) when making the routing decision. However, ACL'99 reviewing was not blind and an interesting question is what is routing performance when classification is based exclusively on the authors' identity. Using only Simbib(aUthor, reviewerj) for the paper's author(s), exact match accuracy completely ignoring the submitted paper (52.5%) approaches that of the accuracy using only the submitted text (56.5%), as shown in Table 10. This suggests that an author's identity alone is largely sufficient for routing the paper to the committee most appropriate for evaluating her or his work.</Paragraph> </Section> </Section> <Section position="5" start_page="225" end_page="226" type="metho"> <SectionTitle> 4 Supervised Learning </SectionTitle> <Paragraph position="0"> The algorithms presented so far are unsupervised; the only use for labeled data in the devtest was for global parameter optimization. This is a strength of the approach presented here, because it can be used successfully without any human annotation. In this section, we tested the efficacy of training supervised models based on initial program chair annotation of a portion of the submitted papers. Models of the types of papers initially assigned to each committee can help select further papers appropriate for that committee. Using the vector model, we can define the centroid Dij of papers initially routed to a given committee as in (2), where Dij = ~P~eC, c (wj, Pk) and c (wj, Pk) is the count associated with paper Pk and the jth word. Rather than use these models in isolation, we combine them with the previously described reviewer centroids for each committee Cij into C~1 = Cij + A * Dij, where the parameter A was optimized in the devtest to be 3. The results are presented in Table 11, and outperform the simple unsupervised model 60.9% to 56.5%, given initial program chair annotation of 1/2 of the data (the devtest set).</Paragraph> <Paragraph position="1"> The updates to the base centroids were made off-line in our method; however, this is not required; annotation of the data once the decision is made (a new paper is routed), the &quot;true&quot; label can be used to update the corresponding centroid. There are numerous methods that could be borrowed from AI and IR to implement this strategy, including Active Learning (Lewis and Gale, 1994). Such online adaptation can maximally leverage program chair feedback and minimize the need for initial tagged training data.</Paragraph> </Section> <Section position="6" start_page="226" end_page="228" type="metho"> <SectionTitle> 5 Automatic Area Committee Generation </SectionTitle> <Paragraph position="0"> In a hierarchical routing system, clearly the composition of the committees is crucial. Suboptimal results are achieved if the 3 most appropriate reviewers for a paper are spread out over different committees.</Paragraph> <Paragraph position="1"> As an experiment to see if the committee organization could possibly be improved, we investigated empirically committee structures using several clustering strategies.</Paragraph> <Paragraph position="2"> In the first test, we generated a hierarchical agglomerative cluster of the entire reviewer set based on the pairwise cosine similarity between their publication vectors, using maximal linkage clustering (Duda and Hart, 1973; Jain and Dubes, 1988). The results are given in Figure 2a, showing the full tree and extracted cluster list. The numbers in brackets indicate the actual committee assignment of the reviewers; basic inspection will indicate that the derived clusters correspond closely to existing committee compositions (although this information was completely ignored in the clustering process). Analysis of the substructure in the tree shows a natural sub-clustering by research subfocus (e.g. ((isabelle (knight (fung wu))) somers)). Inspection will also show that people with close research focus are spread out among 3 or more different committees, raising * some doubts about the optimality of any committee-based routing process.</Paragraph> <Paragraph position="3"> In another experiment, we tested the extent to which committees could more productively be reformed by beginning with the initial committee centroids and redistributing the reviewers using K-means clustering. We used a modified version of it to obtain reviewer groups that are balanced in size similar to the original committees. This was done by limiting the class size to the the maximum number of reviewers in an original class; the starting point of the algorithm was based on the original committees. The resulting clusters are shown in Figure 2b. The basic initial committee composition is preserved, with some outliers reassigned.</Paragraph> <Paragraph position="4"> A third experiment was conducted to see if committees could be reconstructed to better match the committee assignment of papers as proposed by the program chair. Specifically, we &quot;reversed&quot; the routing problem by computing committee centroids based on the set of submitted papers assigned to the committee by the program chair, and then routed the reviewers to the committees as if each reviewer was an abstract. In this case, we did not impose any restriction on committee size. The results are shown in Figure 2c. One can still see the original committees in the new organization; the fact that the third committee is large (21 reviewers, almost one third of the whole population) can be probably explained by the fact that the papers routed to committee 3 were interdisciplinary, therefore they had a lot in common with many reviewers.</Paragraph> <Paragraph position="5"> Another meaningful measure for clustering is the Simbib (authori, authorj) based on transitive bibliographic citation and co-authorship (Section 3.4).</Paragraph> <Paragraph position="6"> Figure 2d shows the results of applying maximum linkage agglomerative clustering to this similarity measure. This also shows some correlation with the manually chosen committees.</Paragraph> <Paragraph position="7"> Finally, it is readily noted by the human judges that certain committees (such as 3 and 4) were quite similar and difficult to distinguish. We can use agglomerative hierarchical clustering of our committee profile centroids to achieve some measure of relative committee distance. The following tree confirms human intuition regarding committee similarity: One application of this tree and associated distances is to weight the cost of committee misassignments by the severity of the error. The majority of the system errors noted in Table 4 are between (3,4) and (1,6), which this empirical clustering would indicate are relatively low cost mistakes.</Paragraph> <Section position="1" start_page="227" end_page="228" type="sub_section"> <SectionTitle> Measures for Routing </SectionTitle> <Paragraph position="0"> The routing algorithms presented here have two natural modes of application. The system's committee recommendations can be used either for post-hoc routing error identification (as a sanity check) or for pre-hoc initial automatic assignment with human verification 5 . The latter strategy requires some measure of system confidence for optimal application. Such a measure would help a human judge minimize the time spent in performing the task. If the system is very confident, one might even decide to accept the decision without careful review. On the other hand, in cases where the system is not confident, full attention is required.</Paragraph> <Paragraph position="1"> Based on the ranked output of the system, we 5The former strategy was actually employed in ACL'99 reviewing.</Paragraph> <Paragraph position="2"> searched for feature transformations whose output can be used in determining confidence intervals. A reasonable one is 5 = =1-=2 where the Xl and x2 =1 are the scores associated with the first and second choices of the system. A plot of the averaged accuracy of this operator is depicted in Figure 1 (the value interval was divided in 10 equal and partially overlapping bins and average accuracy was computed on each one of them). The graph on the right shows the accuracy in the case where ranking the gold standard as the system's second committee choice is not considered an error.</Paragraph> <Paragraph position="3"> One conclusion that can be drawn from the plots is that one can be relatively confident in the system classification if the value of 6 is above the 0.25 threshold, while 6 < 0.1 tends to indicate lowest expected accuracy and greatest need for careful human inspection. Such confidence measures may also be used in posthoc correction of human assignments to rank the most likely human errors for re-inspection.</Paragraph> </Section> </Section> <Section position="7" start_page="228" end_page="229" type="metho"> <SectionTitle> 7 Human Performance and Consensus Generation </SectionTitle> <Paragraph position="0"> Area committee routing is a difficult task for humans. Table 12 shows the relatively low inter-judge agreement rates for the 4 judges mentioned in Section 2.2 when annotating the 46-word primary test set. Judge 1 (the program chair) had a slightly different objective function for routing (including the avoidance of conflicts of interest and perhaps some committee size balancing), explaining slightly lower agreement rates than that between the two faculty members (Judges 2 and 3) who had the same task description of finding the most appropriate committee without constraint. Judge 4 was a knowledgeable but less experienced 3rd year graduate student, and his lower performance relative to his colleagues may have been due to more limited familiarity with the reviewers and their expertise.</Paragraph> <Paragraph position="1"> In order to improve the quality of the gold standard, a consensus standard was generated by taking the majority vote of Judges 1-3. In case of a tie, the program chair was used as the definitive assignment. In nearly 80% of the data, the consensus was identical to the program chair's assignment.</Paragraph> <Paragraph position="2"> Table 13 illustrates the performance of Judge 4 and the reference Systems (Section 3.1 and 3.4) for both the Judge 1 and Consensus gold standards.</Paragraph> <Paragraph position="3"> Both Judge 4 and the System agreed substantially more with the consensus than the Judge 1 standard, providing some evidence for the relative merit of the consensus standard. The most interesting result, however, is that the system performed better than the graduate student Judge 4 for both standards (although generally lower than the performance of the more experienced faculty members). This suggests that system performance, by virtue of its inherently much greater familiarity with the publications and hence the expertise of the reviewers, more than compensates for its rather limited skills at generalization and inference. This would suggest that the proposed algorithm may be as effective (or even more effective than) human paper routers except for the most knowledgeable human judges.</Paragraph> <Paragraph position="4"> The final observation is that in cases where there is high agreement among the human judges, system routing accuracy is also very high. Table 14 divides the data by thresholds of minimum agreement between the Judges 1-3, as the primary partitioning principle using the Section 3.1 system without the Simbib extension. Given a certain level of agreement (e.g. all 3 judges agree), it's also useful to consider whether the 4th Judge agreed or not with that consensus. By giving the less-experienced 4th Judge an effective 1/2 vote, further refinement in the granularity of consensus can be obtained without effectiving the primacy of the votes of Judges 1-3. In the 57% of the data where only the first 3 judges agree, system accuracy exceeds 80%. In the most confidently classified 35% of the data where all 4 judges agree, system accuracy approaches 88% and in 100% of these cases the consensus committee was one of the system's top two choices. These results strongly suggest that in the clear-cut cases where humans consistently agree on a classification, system performance is very reliable too. The large bulk of system &quot;errors&quot; are in cases where humans tend to disagree as well.</Paragraph> </Section> class="xml-element"></Paper>