TANGO: Bilingual Collocational Concordancer 
Jia-Yan Jian  
Department of Computer 
Science 
National Tsing Hua 
University 
101, Kuangfu Road, 
Hsinchu, Taiwan 
g914339@oz.nthu.edu.tw 
Yu-Chia Chang  
Inst. of Information 
System and Applictaion 
National Tsing Hua 
University 
101, Kuangfu Road, 
Hsinchu, Taiwan 
u881222@alumni.nthu.e
du.tw 
Jason S. Chang 
Department of Computer 
Science 
National Tsing Hua 
University 
101, Kuangfu Road, 
Hsinchu, Taiwan 
jschang@cs.nthu.edu.tw 
 
Abstract 
In this paper, we describe TANGO as a 
collocational concordancer for looking up 
collocations. The system was designed to 
answer user’s query of bilingual collocational 
usage for nouns, verbs and adjectives. We first 
obtained collocations from the large 
monolingual British National Corpus (BNC). 
Subsequently, we identified collocation 
instances and translation counterparts in the 
bilingual corpus such as Sinorama Parallel 
Corpus (SPC) by exploiting the word-
alignment technique. The main goal of the 
concordancer is to provide the user with a 
reference tools for correct collocation use so 
as to assist second language learners to acquire 
the most eminent characteristic of native-like 
writing. 
1 Introduction 
Collocations are a phenomenon of word 
combination occurring together relatively often. 
Collocations also reflect the speaker’s fluency of a 
language, and serve as a hallmark of near native-
like language capability. 
Collocation extraction is critical to a range of 
studies and applications, including natural 
language generation, computer assisted language 
learning, machine translation, lexicography, word 
sense disambiguation, cross language information 
retrieval, and so on.  
Hanks and Church (1990) proposed using point-
wise mutual information to identify collocations in 
lexicography; however, the method may result in 
unacceptable collocations for low-count pairs. The 
best methods for extracting collocations usually 
take into consideration both linguistic and 
statistical constraints. Smadja (1993) also detailed 
techniques for collocation extraction and 
developed a program called XTRACT, which is 
capable of computing flexible collocations based 
on elaborated statistical calculation. Moreover, log 
likelihood ratios are regarded as a more effective 
method to identify collocations especially when the 
occurrence count is very low (Dunning, 1993).    
Smadja’s XTRACT is the pioneering work on 
extracting collocation types. XTRACT employed 
three different statistical measures related to how 
associated a pair to be collocation type. It is 
complicated to set different thresholds for each 
statistical measure. We decided to research and 
develop a new and simple method to extract 
monolingual collocations. 
We also provide a web-based user interface 
capable of searching those collocations and its 
usage. The concordancer supports language 
learners to acquire the usage of collocation. In the 
following section, we give a brief overview of the 
TANGO concordancer. 
2 TANGO 
TANGO is a concordancer capable of answering 
users’ queries on collocation use. Currently, 
TANGO supports two text collections: a 
monolingual corpus (BNC) and a bilingual corpus 
(SPC). The system consists of four main parts: 
2.1 Chunk and Clause Information 
Integrated 
For CoNLL-2000 shared task, chunking is 
considered as a process that divides a sentence into 
syntactically correlated parts of words. With the 
benefits of CoNLL training data, we built a 
chunker that turn sentences into smaller syntactic 
structure of non-recursive basic phrases to 
facilitate precise collocation extraction. It becomes 
easier to identify the argument-predicate 
relationship by looking at adjacent chunks. By 
doing so, we save time as opposed to n-gram 
statistics or full parsing. Take a text in CoNLL-
2000 for example: 
The words correlated with the same chunk tag 
can be further grouped together (see Table 1). For 
instance, with chunk information, we can extract 
