Proceedings of HLT/EMNLP 2005 Demonstration Abstracts, pages 30–31,
Vancouver, October 2005.
MBOI: Discovery of Business Opportunities on the Internet
Extended Abstract
Arman Tajarobi, Jean-Franc‚ois Garneau
Nstein Technologies
Qu·ebec, Canada
{arman.tajarobi,jf.garneau}
@nstein.com
Franc‚ois Paradis
Universit·e de Montr·eal
Qu·ebec, Canada
paradifr@iro.umontreal.ca
We propose a tool for the discovery of business
opportunities on the Web, more speci cally to help
a user  nd relevant call for tenders (CFT), i.e. in-
vitations to contractors to submit a tender for their
products/services. Simple keyword-based Informa-
tion Retrieval do not capture the relationships in the
data, which are needed to answer the complex needs
of the users. We therefore augment keywords with
information extracted through natural language pro-
cessing and business intelligence tools. As opposed
to most systems, this information is used at all stages
in the back-end and interface. The bene ts are two-
fold:  rst we obtain higher precision of search and
classi cation, and second the user gains access to a
deeper level of information.
Two challenges are: how to discover new CFT
and related documents on the Web, and how to ex-
tract information from these documents, knowing
that the Web offers no guarantee on the structure and
stability of those documents. A major hurdle to the
discovery of new documents is the poor degree of
 linkedness between businesses, and the open topic
area, which makes topic-focused Web crawling (Ag-
garwal et al., 2001) unapplicable. To extract infor-
mation, wrappers (Soderland, 1999), i.e. tools that
can recognise textual and/or structural patterns, have
limited success because of the diversity and volatil-
ity of Web documents.
Since we cannot assume a structure for docu-
ments, we exploit information usually contained in
CFTs: contracting authority, opening/closing date,
location, legal notices, conditions of submission,
classi cation, etc. These can appear marked up with
tags or as free-text.
A  rst type of information to extract are the so-
called named entities (Maynard et al., 2001), i.e.
names of people, organisations, locations, time or
quantities. To these standard entities we add some
application-speci c entities such as FAR (regulation
number), product dimensions, etc. To extract named
entities we use Nstein NFinderTM, which uses a com-
bination of lexical rules and a dictionary. More de-
tails about the entities, statistics and results can be
found in (Paradis and Nie, 2005a).
We use another tool, Nstein NconceptTM, to ex-
tract concepts, which capture the  themes or  rele-
vant phrases in a document. NConcept uses a com-
bination of statistics and linguistic rules.
As mentioned above, CFTs not only contains in-
formation about the subject of the tender, but also
procedural and regulation information. We tag pas-
sages in the document as  subject or  non-subject ,
according to the presence or absence of the most
discriminant bigrams. Some heuristics are also ap-
plied to use the  good predictors such as URL and
money, or to further re ne the non-subject passages
into  regulation . More details can be found in (Par-
adis and Nie, 2005b).
Another information to extract is the industry or
service, according to a classi cation schema such
as NAICS (North American Industry Classi cation
System) or CPV (Common Procurement Vocabu-
lary). We perform multi-schema, multi-label classi-
 cation, which facilitates use across economic zones
(for instance, an American user may not be familiar
with CPV, a European standard) and confusion over
schemas versions (NAICS version 1997/Canada vs.
NAICS version 2002). Our classi er is a simple
Naive Bayes, trained over 20,000 documents gath-
ered from an American Government tendering site,
FBO (Federal Business Opportunities). Since we
have found classi cation to be sensitive to the pres-
30
ence of procedural contents, we remove non-subject
passages, as tagged above. The resulting perfor-
mance is 61% micro-F1 (Paradis and Nie, 2005b).
Finally, a second level of extraction is performed
to infer information about organisations: their con-
tacts, business relationships, spheres of activities,
average size of contract, etc. This is refered to as
business intelligence (Betts, 2003). For this extrac-
tion we not only use CFTs, but also awards (i.e.
past information about successful bids) and news
(i.e. articles published about an organisation). For
news, we collect co-occurences of entities and clas-
sify them using a semantic network. For example,
the passage  Sun vs. Microsoft is evidence towards
the two companies being competitors.
The extracted information is indexed and queried
using Apache Lucene., with a Web front-end served
by Jakarta Turbine. The interface was designed to
help the user make the most of the extracted infor-
mation, whether in query formulation, document pe-
rusing, or navigation.
Our system supports precise queries by index-
ing free-text and extracted information separately.
For example, the simple keyword query  bush re-
turns all documents where the word occurs, includ-
ing documents about bush trimming and president
Bush, while the query  person:Bush only returns
documents about President Bush. However such
queries are not very user-friendly. We thus provide
an interface for advanced queries and query re ne-
ment.
The extracted information from the 100 top query
results is gathered and presented in small scrollable
lists, one for each entity type. For example, starting
with keyword  bush , the user sees a list of people
in the  person box, and could choose  Bush to re-
 ne her query. The list is also used to expand the
query with a related concept (for example,  removal
services is suggested for  snow ), the expansion of
an acronym, etc.
Queries can be automatically translated using
Cross-Language Information Retrieval techniques
(Peters et al., 2003). To this end we have built a sta-
tistical translation model trained from a collection
of 100,000 French-English pair documents from a
European tendering site, TED (Tenders Electronic
Daily). Two dictionaries were built: one with simple
terms, and one with  concepts , extracted as above.
The intuition is that simple terms will offer better
recall while concepts will give better precision.
The interface shows and allows navigation to the
extracted information. When viewing a CFT, the
user can highlight the entities, as well as the subject
and regulation passages. She can also click on an
organisation to get a company pro le, which shows
the business intelligence attributes as well as related
documents such as past awards or news.
We are currently expanding the business intelli-
gence functionalities, and implementing user  pro-
 les , which will save contextual or background in-
formation and use it transparently to affect querying.
Acknowledgments
This project was  nanced jointly by Nstein Tech-
nologies and NSERC.
References
Charu C. Aggarwal, Fatima Al-Garawi, and Philip S.
Yu. 2001. Intelligent crawling on the world wide
web with arbitrary predicates. In Proceedings Inter-
national WWW Conference.
Mitch Betts. 2003. The future of business intelligence.
Computer World, 14 April.
D. Maynard, V. Tablan, C. Ursu, H. Cunningham, and
Y. Wilks. 2001. Named entity recognition from di-
verse text types. In Recent Advances in Natural Lan-
guage Processing, pages 257 274.
Franc‚ois Paradis and Jian-Yun Nie. 2005a. Discovery
of business opportunities on the internet with informa-
tion extraction. In IJCAI-05 Workshop on Multi-Agent
Information Retrieval and Recommender Systems, 31
July.
Franc‚ois Paradis and Jian-Yun Nie. 2005b. Filtering con-
tents with bigrams and named entities to improve text
classi cation. In Asia Information Retrieval Sympo-
sium, 13 15 October.
C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, edi-
tors. 2003. Advances in Cross-Language Information
Retrieval Systems. Springer.
Stephen Soderland. 1999. Learning information extrac-
tion rules for semi-structured and free text. Machine
Learning, 44(1).
31
