Proceedings of the 
1999 Joint SIGDAT Conference 
on 
Empirical Methods in Natural 
Language Processing 
and 
Very Large Corpora 
Sponsored by 
The Association for Computational Linguistics 
SIGDAT 
LEXIS-NEXIS, a Division of Reed Elsevier, Inc. 
Hong Kong University of Science and Technology 
Edited by 
Pascale Fung 
and 
Joe Zhou 
21-22 June 1999 
University of Maryland 
College Park, MD, USA 

Proceedings of the 
1999 Joint SIGDAT Conference 
on 
Empirical Methods in Natural 
Language Processing 
and 
Very Large Corpora 
Sponsored by 
The Association for Computational Linguistics 
SIGDAT 
LEXIS-NEXIS, a Division of Reed Elsevier, Inc. 
Hong Kong University of Science and Technology 
Edited by 
Pascale Fung 
and 
Joe Zhou 
21-22 June 1999 
University of Maryland 
College Park, MD, USA 
© 1999, Association for Computational Linguistics 
Order additional copies from: 
Association for Computational Linguistics 
75 Paterson Street, Suite 9 
New Brunswick, NJ 08901 USA 
+ 1-732-342-9100 phone 
+ 1-732-342-9339 fax 
acl@aclweb.org 
SPONSORS: 
The Association for Computational Linguistics (ACL) 
SIGDAT (ACL's SIG for Linguistic Data and Corpus-based Approaches to NLP) 
LEXIS-NEXIS, a Division of Reed Elsevier, Inc. 
Hong Kong University of Science and Technology, Human Language Technology Center 
INVITED SPEAKERS: 
Kenneth W. Church (AT&T Labs-Research) 
Richard Schwartz (BBN Technologies) 
ORGANIZERS: 
Pascale Fung, Chair 
Joe Zhou, Co-chair 
PROGRAM COMMITTEE: 
Jing-Shin Chang 
Ken Church 
Ido Dagan 
Marti Hearst 
Huang, Changning 
Pierre Isabelle 
Lillian Lee 
David Lewis 
Dan Melamed 
Mehryar Mohri 
Masaaki Nagata 
Richard Sproat 
Andreas Stolcke 
Ralph Weischedel 
Dekai Wu 
David Yarowsky 
(Behavior Design Corp.) 
(AT&T Labs-Research) 
(Bar-Ilan University) 
(UC-Berkeley) 
(Microsoft Research China) 
(Xerox Research Europe) 
(Comell University) 
(AT&T Labs-Research) 
(West Group) 
(AT&T Labs-Research) 
(NTT) 
(AT&T Labs-Research) 
(SRI International) 
(BBN) 
(HKUST) 
(Johns Hopkins University) 
ADDITIONAL REVIEWERS: 
Srinivas Bangalore 
Rebecca Bruce 
Michael Collins 
Gregory Grefenstette 
Vasileios Hatzivassiloglou 
David Hull 
Peter Jackson 
Christian Jacquemin 
Liu, Xiaohu- 
Sung Hyon Myaeng 
Shimei Pan 
Ted Pederson 
Roberto Pieraccini 
Ellen Riloff 
Hinrich Shtitze 
Yannis Stylianou 
Zhao, Jun 
(AT&T Labs-Research) 
(Univ. of North Carolina) 
(AT&T Labs - Research) 
(Xerox Research Europe) 
(Columbia University) 
(Xerox Research Europe) 
(West Group) 
(LIMSI) 
(HKUST) 
(Chunguam National Univ.) 
(Columbia University) 
(Cal Poly) 
(AT&T Labs-Research) 
(University of Utah) 
(Xerox PARC) 
(AT&T Labs-Research) 
(HKUST) 
FURTHER INFORMATION: 
Pascale Fung 
Human Language Technology Center 
Department of Electrical and Electronic Engineering 
University of Science and Tehnology (HKUST) 
Clear Water Bay, Kowloon 
Hong Kong 
Email: pascale@ee.ust.hk 
.°. |/| 
Joe Zhou 
LEXIS-NEXIS, a Division of Reed Elsevier 
9555 Springboro Pike 
Dayton, OH 45342 
USA 
Email: joez@lexis-nexis.com 
CONFERENCE PROGRAM 
Monday, June 21 
8:45-9:00 Welcome 
9:00-9:40 INVITED SPEECH 
What's Happened Since the First SIGDAT Meeting ? 
Kenneth W. Church (AT&T Labs-Research) 
9:40-9:50 Short Break 
9:50-10:10 Text-Translation Alignment: Three Languages are Better than Two 
Michel Simard 
10:10-10:30 Mapping Multilingual Hierarchies Using Relaxation Labeling 
J. Daud6, L. Padr6 and G. Rigau 
10:30-10:50 Improved Alignment Models for Statistical Machine Translation 
Franz Josef Och, Christoph Tillmann, and Hermann Ney 
10:50-11:10 Cross-Language Information Retrieval for Technical Documents 
Atsushi Fujii and Tetsuya Ishikawa 
11:10-11:30 Break 
11:30-11:50 Boosting Applied to Tagging and PP Attachment 
Steven Abney, Robert E. Schapire and Yoram Singer 
11:50-12:10 Applying Extrasentential Context to Maximum Entropy Based Tagging with 
a Large Semantic and Syntactic Tagset 
Ezra Black, Andrew Finch and Ruigiang Zhang 
12:10-12:30 Improving POS Tagging Using Machine-Learning Techniques 
Llufs Mhrquez, Horacio Rodrfguez, Josep Carmona and Josep Montolio 
12:30-14:00 LUNCH 
14:00-14:20 Determining the Specificity of Nouns From Text 
Sharon A. Caraballo and Eugene Charniak 
14:20-14:40 Retrieving Collocations From Korean Text 
Seonho Kim, Zooil Yang, Mansuk Song and Jung-Ho Ahn 
14:40-15:00 Noun Phrase Coreference as Clustering 
Claire Cardie and Kiri Wagstaff 
15:00-15:20 Break 
15:20-15:40 Language Independent Named Entity Recognition Combining Morphological and 
Contextual Evidence 
Silviu Cucerzan and David Yarowsky 
15:40-16:00 Unsupervised Models for Named Entity Classification 
Michael Collins and Yoram Singer 
16:00-16:20 Hybrid Disambiguation of Prepositional Phrase Attachment and Interpretation 
Sven Hartrumpf 
16:20-16:40 HMM Specialization with Selective Lexicalization 
Jin-Dong Kim, Sang-Zoo Lee and Hae-Chang Rim 
iv 
Tuesday, June 22 
9:00-9:40 INVITED SPEECH 
Why Doesn't Natural Language Come Naturally? 
Richard Schwartz (BBN Technologies) 
9:40-9:50 Short Break 
9:50-10:10 POS Tags and Decision Trees for Language Modeling 
Peter A. Heeman 
10:10-10:30 An Information-Theoretic Empirical Analysis of Dependency-Based Feature 
Types for Word Prediction Models 
Dekai Wu, Zhao Jun and Sui Zhifang 
10:30-10:50 Word Informativeness and Automatic Pitch Accent Modeling 
Shimei Pan and Kathleen McKeown 
10:50-11:10 Learning Discourse Relations with Active Data Selection 
Tadashi Nomoto and Yuji Matsumoto 
11:10-11:30 Break 
11:30-11:50 A Learning Approach to Shallow Parsing 
Marcia Mufioz, Vasin Punyakanok, Dan Roth and Dav Zimak 
11:50-12:10 
12:10-12:30 
Guiding a Well-Founded Parser with Corpus Statistics 
Amon Seagull and Lenhart Schubert 
Exploiting Diversity in Natural Language Processing: Combining Parsers 
John Henderson and Eric Bnll 
12:30-14:00 LUNCH 
14:00-15:10 Panel Discussion 
The Future of Language Technologies: Research, Development and Marketing 
Ken Church (AT&T), Pierre Isabelle (Xerox Europe), Roberto Pieraccini (AT&T), 
John Rausch (Lexis-Nexis), Keh-Yih Su (Behavior Design Corp.), Raphael Wong (Intel) 
15:10-15:20 Short Break 
15:20-15:40 
15:40-16:00 
Lexical Ambiguity and Information Retrieval Revisted 
Julio Gonzalo, Anselmo Pefias and Felisa Verdejo 
Detecting Text Similarity over Short Passages: Exploring Linguistic Feature 
Combinations via Machine Learning 
Vasileios Hatzivassiloglou, Judith L. Klavans and Eleazar Eskin 
16:00-16:20 
16:20-16:40 
Automated Construction of Weighted String Similarity Measures 
J6rg Tiedernann 
Taking the Load Off the Conference Chairs: Towards a Digital Paper Routing Assistant 
David Yarowsky and Radu Flonan 

TABLE OF CONTENTS 
What's Happened Since the First SIGDAT Meeting ? (INVITED TALK) 
Kenneth Ward Church .................................................. ............. 1 
Text-Translation Alignment: Three Languages are Better than Two 
Michel Simard ...................................................................... 2 
Mapping Multilingual Hierarchies Using Relaxation Labeling 
J. Daud6, L. Padr6 and G. Rigau ..................................................... 12 
Improved Alignment Models for Statistical Machine Translation 
Franz Josef Och, Chnstoph Tillmann, and Hermann Ney ..................... , ......... 20 
Cross-Language Information Retrieval for Technical Documents 
Atsushi Fujii and Tetsuya Ishikawa .................................................. 29 
Boosting Applied to Tagging and PP Attachment 
Steven Abney, Robert E. Schapire and Yoram Singer .................................. 38 
Applying Extrasentential Context to Maximum Entropy Based Tagging With A Large 
Semantic And Syntactic Tagset 
Ezra Black, Andrew Finch and Ruigiang Zhang ............................. : .......... 46 
Improving POS Tagging Using Machine-Learning Techniques 
Llufs Mbxquez, Horacio Rodrfguez, Josep Carmona and Josep Montolio ................. 53 
Determining the Specificity of Nouns From Text 
Sharon A. Caraballo and Eugene Charniak ................................. ........... 63 
Retrieving Collocations From Korean Text 
Seonho Kim, Zooil Yang, Mansuk Song and Jung-Ho Ahn ............................. 71 
Noun Phrase Coreference as Clustering 
Claire Cardie and Kifi Wagstaff ......................................... . ........... 82 
Language Independent Named Entity Recognition Combining Morphological and 
Contextual Evidence 
Silviu Cucerzan and David Yarowsky ................................................ 90 
Unsupervised Models for Named Entity Classification 
Michael Collins and Yoram Singer .................................................. 100 
Hybrid Disambiguation of Prepositional Phrase Attachment and Interpretation 
Sven Hartrumpf .................................................................. 111 
HMM Specialization with Selective Lexicalization 
Jin-Dong Kim, Sang-Zoo Lee and Hae-Chang Rim ................................... 121 
Why Doesn't Natural Language Come Naturally? (INVITED TALK) 
Richard Schwartz ................................................................. 128 
POS Tags and Decision Trees for Language Modeling 
Peter A. Heeman .................................................................. 129 
vii 
An Information-Theoretic Empirical Analysis of Dependency-Based Feature Types 
for Word Prediction Models 
Dekai Wu, Zhao Jun and Sui Zhifang ............................................... 138 
Word Informativeness and Automatic Pitch Accent Modeling 
Shimei Pan and Kathleen McKeown ................................................ 148 
Learning Discourse Relations with Active Data Selection 
Tadashi Nomoto and Yuji Matsumoto ............................................... 158 
A Learning Approach to Shallow Parsing 
Marcia Mufioz, Vasin Punyakanok, Dan Roth and Dav Zimak ......................... 168 
Guiding a Well-Founded Parser with Corpus Statistics 
Amon Seagull and Lenhart Schubert ................................................ 179 
Exploiting Diversity in Natural Language Processing: Combining Parsers 
John Henderson and Eric Bfill ..................................................... 187 
Lexical Ambiguity and Information Retrieval Revisted 
Julio Gonzalo, Anselmo Pefias and Felisa Verdejo .................................... 195 
Detecting Text Similarity over Short Passages: Exploring Linguistic Feature 
Combinations via Machine Learning 
Vasileios Hatzivassiloglou, Judith L. Klavans and Eleazar Eskin ....................... 203 
Automated Construction of Weighted String Similarity Measures 
J6rg Tiedemann ................................................................... 213 
Taking the Load Off the Conference Chairs: Towards a Digital Paper Routing Assistant 
David Yarowsky and Radu Florian .................................................. 220 
PP-attachment: A Committee Machine Approach 
Martha A. Alegre, Josep M. Sopena and Agusti Lloberas ............................. 231 
Cascaded Grammatical Relation Assignment 
Sabine Buchholz, Jorn Veenstra and Walter Daelemans ............................... 239 
Automatically Merging Lexicons that have Incompatible Part-of-Speech Categories 
Daniel Ka-Leung Chan and Dekai Wu .............................................. 247 
An lterative Approach to Estimating Frequencies over a Semantic Hierarchy 
Stephen Clark and David Weir ..................................................... 258 
Using Subcategorization to Resolve Verb Class Ambiguity 
Maria Lapata and Chris Brew ...................................................... 266 
Improving Brill's POS Tagger for an Agglutinative Language 
Betita Megyesi .................................................................... 275 
Corpus-Based Learning for Noun Phrase Coreference Resolution 
Wee Meng Soon, Hwee Tou Ng and Chung Yong Lim ................................ 285 
Corpus-Based Approach for Nominal Compound Analysis for Korean Based on 
Linguistic and Statistical Information 
Juntae Yoon, Key-Sun Choi and Mansuk Song ....................................... 292 
viii 
AUTHOR INDEX 
Steven Abney ......................... 38 
Jung-Ho Ahn ......................... 71 
Martha A. Alegre .................... 231 
Ezra Black ............................ 46 
Chris Brew .......................... 266 
Eric Brill ............................ 187 
Sabine Buchholz ..................... 239 
Sharon A. Caraballo .................. 63 
Claire Cardie ......................... 82 
Josep Carmona ....................... 53 
Daniel Ka-Leung Chan .............. 247 
Eugene Charniak ..................... 63 
Key-Sun Choi ....................... 292 
Kenneth Ward Church ................. 1 
Stephen Clark . ...................... 258 
Michael Collins ...................... 100 
Silviu Cucerzan ....................... 90 
Walter Daelemans ................... 239 
J. Daud@ .............................. 12 
Eleazar Eskin ........................ 203 
Andrew Finch ........................ 46 
Radu Florian ........................ 220 
Atsushi Fujii .......................... 29 
Julio Gonzalo ........................ 195 
Sven Hartrumpf ..................... 111 
Vasileios Hatzivassiloglou ............ 203 
Peter A. Heeman .................... 129 
John Henderson ..................... 187 
Tetsuya Ishikawa ..................... 29 
Jin-Dong Kim ....................... 121 
Seonho Kim .......................... 71 
Judith L. Klavans ................... 203 
Maria Lapata ........................ 266 
Sang-Zoo Lee ........................ 121 
Chung Yong Lim .................... 285 
Agusti Lloberas ...................... 231 
Lluis M~rquez ........................ 53 
Yuji Matsumoto ..................... 158 
Kathleen McKeown .................. 148 
Be£ta Megyesi ....................... 275 
Josep Montolio ....................... 53 
Marcia Mufioz ....................... 168 
Hermann Ney ......................... 20 
Hwee Tou Ng ........................ 285 
Tadashi Nomoto ..................... 158 
Franz Josef Och ...................... 20 
L. Padr5 .............................. 12 
Shimei Pan .......................... 148 
Anselmo Pefias ...................... 195 
Vasin Punyakanok ................... 168 
G. Rigau ............................. 12 
Ha,e-Chang Rim ..................... 121 
Horacio Rodriguez .................... 53 
Dan Roth ........................... 168 
Robert E. Schapire ....... : ........... 38 
Lenhart Schubert .................... 179 
Richard Schwartz .................... 128 
Amon Seagull ........................ 179 
Michel Simard ......................... 2 
Yoram Singer .................... 38, 100 
Mansuk Song .................... 71, 292 
Wee Meng Soon ..................... 285 
Josep M. Sopena .................... 231 
Sui Zhifang .......................... 138 
JSrg Tiedemann ..................... 213 
Christoph Tillmann ................... 20 
Jorn Veenstra ........................ 239 
Felisa Verdejo ....................... 195 
Kiri Wagstaff ......................... 82 
David Weir .......................... 258 
Dekai Wu ....................... 138, 247 
Zooil Yang ............................ 71 
David Yarowsky ................. 90, 220 
Juntae Yoon ......................... 292 
Ruigiang Zhang ....................... 46 
Zhao Jun ............................ 138 
Day Zimak .......................... 168 
/x 

