Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 345–352,
Sydney, July 2006. c©2006 Association for Computational Linguistics
A Collaborative Framework for Collecting Thai Unknown Words from
the Web
Choochart Haruechaiyasak, Chatchawal Sangkeettrakarn, Pornpimon Palingoon
Sarawoot Kongyoung and Chaianun Damrongrat
Information Research and Development Division (RDI)
National Electronics and Computer Technology Center (NECTEC)
Thailand Science Park, Klong Luang, Pathumthani 12120, Thailand
rdi5@nnet.nectec.or.th
Abstract
We propose a collaborative framework for
collecting Thai unknown words found on
Web pages over the Internet. Our main
goal is to design and construct a Web-
based system which allows a group of in-
terested users to participate in construct-
ing a Thai unknown-word open dictionary.
The proposed framework provides sup-
porting algorithms and tools for automati-
cally identifying and extracting unknown
words from Web pages of given URLs.
The system yields the result of unknown-
word candidates which are presented to
the users for verification. The approved
unknown words could be combined with
the set of existing words in the lexicon
to improve the performance of many NLP
tasks such as word segmentation, infor-
mation retrieval and machine translation.
Our framework includes word segmenta-
tion and morphological analysis modules
for handling the non-segmenting charac-
teristic of Thai written language. To take
advantage of large available text resource
on the Web, our unknown-word boundary
identification approach is based on the sta-
tistical string pattern-matching algorithm.
Keywords: Unknown words, open dictio-
nary, word segmentation, morphological
analysis, word-boundary detection.
1 Introduction
The advent of the Internet and the increasing pop-
ularity of the Web have altered many aspects of
natural language usage. Asmorepeople turntothe
Internet as a new communicating channel, the tex-
tual information has increased tremendously and
is also widely accessible. More importantly, the
available information is varied largely in terms of
topic difference and multi-language characteristic.
It is not uncommon to find a Web page written in
Thailies adjacent to aWebpage written in English
via a hyperlink, or a Web page containing both
Thai and English languages. In order to perform
well in this versatile environment, an NLP system
must be adaptive enough to handle the variation in
language usage. One of the problems which re-
quires special attention is unknown words.
As with most other languages, unknown words
also play an extremely important role in Thai-
language NLP. Unknown words are viewed as one
of the problematic sources of degrading the per-
formance of traditional NLP applications such as
MT (Machine Translation), IR (Information Re-
trieval) and TTS (Text-To-Speech). Reduction in
the amount of unknown words or being able to
correctly identify unknown words in these sys-
tems would help increase the overall system per-
formance.
The problem of unknown words in Thai lan-
guage is perhaps more severe than in English or
other latin-based languages. As a result of the
information technology revolution, Thai people
have become more familiar with other foreign lan-
guages especially English. It is not uncommon to
hear a few English words over a course of con-
versation between two Thai people. The foreign
words along with other Thai named entities are
among the new words which are continuously cre-
ated and widely circulated. To write a foreign
word, the transliterated form of Thai alphabets is
often used. The Royal Institute of Thailand is the
official organization in Thailand who has respon-
345
sibility and authority indefining andapproving the
use of new words. The process of defining a new
word is manual and time-consuming as each word
must be approved by a working group of linguists.
Therefore, this traditional approach of construct-
ingthe lexicon isnotasuitable solution, especially
for systems running on the Web environment.
Due to the inefficiency of using linguists in
defining new lexicon, there must be a way to au-
tomatically or at least semi-automatically collect
new unknown words. In this paper, we propose
a collaborative framework for collecting unknown
words from Web pages over the Internet. Our
main purpose is to design and construct a system
which automatically identifies and extracts un-
known words found on Web pages of given URLs.
The compiled list of unknown-word candidates is
to be verified by a group of participants. The ap-
proved unknown words are then added to the ex-
isting lexicon along with the other related infor-
mation such as meaning and POS (part of speech).
Thispaper focuses ontheunderlying algorithms
for supporting the process of identifying and ex-
tracting unknown words. The overall process is
composed of two steps: unknown-word detection
and unknown-word boundary identification. The
first step is to detect the locations of unknown-
word occurrences from a given text. Since Thai
language belongs to the class of non-segmenting
language group in which words are written contin-
uously without using any explicit delimiting char-
acter, detection of unknown words could be ac-
complished mainly by using a word-segmentation
algorithm with a morphological analysis. By us-
ing a dictionary-based word-segmentation algo-
rithm, locations of words which are not previ-
ously included in the dictionary will be easily de-
tected. These unknown words belong to the class
of explicit unknown words and often represent the
transliteration of foreign words.
The other class of unknown words is hidden
unknown words. This class includes new words
which are created through the combination of
some existing words in the lexicon. The hidden
unknown words are usually named entities such
as a person’s name and an organization’s name.
Thehidden unknown words could be identified us-
ing the approaches such as n-gram generation and
phrase chunking. The scope of this paper focuses
only on the extraction of the explicit unknown
words. However, the design of our framework also
includes the extraction of hidden unknown words.
We will continue to explore this issue in our future
works.
Once the location of an unknown word is de-
tected, the second step involves the identification
of its boundary. Since we use the Web as our
mainresource, wecould takeadvantage ofitslarge
availability of textual contents. We are interested
in collecting unknown words which occur more
than once throughout the corpus. Unknown words
which occur only once in the large corpus are not
considered as being significant. These words may
be unusual words which are not widely accepted,
orcould be misspelling words. Using this assump-
tion, our approach for identifying the unknown-
word boundary is based on a statistical pattern-
matching algorithm. The basic idea is that the
same unknown word which occurs more than once
would likely to appear in different surrounding
contexts. Therefore, a group of characters which
form the unknown word could be extracted by an-
alyzing the string matching patterns.
To evaluate the effectiveness of our proposed
framework, experiments using a real data set col-
lected from the Web are performed. The experi-
ments are designed to test each of the two main
steps of the framework. Variation of morphologi-
cal analysis are tested for the unknown-word de-
tection. The detection rate of unknown words
were found to be as high as approximately 96%.
Three variations of string pattern-matching tech-
niques were tested for unknown-word boundary
identification. The identification accuracy was
found to be as high as approximately 36%. The
relatively low accuracy is not the major concern
since the unknown-word candidates are to be ver-
ified and corrected by users before they are ac-
tually added to the dictionary. The system is
implemented via the Web-browser environment
which provides user-friendly interface for verifi-
cation process.
The rest of this paper is organized as fol-
lows. The next section presents and discusses
related works previously done in the unknown-
word problem. Section 3 provides an overview
of unknown-word problem in the relation to the
word-segmentation process. Section4presents the
proposed framework with underlying algorithms
in details. Experiments are performed in Section
5 with results and discussion. The conclusion is
given in Section 6.
346
2 Previous Works
The research and study in unknown-word prob-
lem have been extensively done over the past
decades. Unknown words are viewed as prob-
lematic source in the NLP systems. Techniques
in identifying and extracting unknown words are
somewhat language-dependent. However, these
techniques could be classified into two major cat-
egories, one for segmenting languages and an-
other for non-segmenting languages. Segment-
ing languages, such as latin-based languages, use
delimiting characters to separate written words.
Therefore, once the unknown words are detected,
their boundaries could be identified relatively eas-
ily when compared to those for non-segmenting
languages.
Some examples of techniques involving
segmenting languages are listed as follows.
Toole (2000) used multiple decision trees to
identify names and misspellings in English texts.
Features used in constructing the decision trees
are, for example, POS (Part-Of-Speech), word
length, edit distance and character sequence
frequency. Similarly, a decision-tree approach
was used to solve the POS disambiguation
and unknown word guessing in (Orphanos and
Christodoulakis, 1999). The research in the
unknown-word problem for segmenting lan-
guages is also closely related to the extraction of
named entities. The difference of these techniques
to those in non-segmenting languages is that
the approach needs to parse the written text in
word-level as opposed to character-level.
The research in unknown-word problem for
non-segmenting languages is highly active for
Chinese and Japanese. Many approaches have
been proposed and experimented with. Asahara
and Matsumoto (2004) proposed a technique of
SVM-based chunking to identify unknown words
from Japanese texts. Their approach used a sta-
tistical morphological analyzer to segment texts
into segments. The SVM was trained by using
POS tags to identify the unknown-word bound-
ary. Chen and Ma (2002) proposed a practical
unknown word extraction system by considering
both morphological and statistical rule sets for
word segmentation. Chang and Su (1997) pro-
posed an unsupervised iterative method for ex-
tracting unknown lexicons from Chinese text cor-
pus. Theiridea is toinclude the potential unknown
words to the augmented dictionary in order to im-
prove the word segmentation process. Their pro-
posed approach also includes both contextual con-
straints and the joint character association metric
to filter the unlikely unknown words. Other ap-
proaches to identify unknown words include sta-
tistical or corpus-based (Chen and Bai, 1998), and
the use of heuristic knowledge (Nie et al. , 1995)
and contextual information (Khoo and Loh, 2002).
Some extensions to unknown-word identification
have been done. An example include the determi-
nation of POS for unknown words (Nakagawa et
al. , 2001).
The research in unknown words for Thai lan-
guage has not been widely done as in other lan-
guages. Kawtrakul et al. (1997) used the combina-
tion of a statistical model and a set of context sen-
sitive rules to detect unknown words. Our frame-
workhasadifferent goal from previous works. We
consider unknown-word problem as collaborative
task among a group of interested users. As more
textual content is provided to the system, new un-
known words could be extracted with more accu-
racy. Thus, our framework can be viewed as col-
laborative and statistical or corpus-based.
3 Unknown-Word Problem in Word
Segmentation Algorithms
Similar to Chinese, Japanese and Korea, Thai lan-
guage belongs to the class of non-segmenting lan-
guages in which words are written continuously
without using any explicit delimiting character.
To handle non-segmenting languages, the first re-
quired step istoperform word segmentation. Most
word segmentation algorithms use a lexicon or
dictionary to parse texts at the character-level. A
typical word segmentation algorithm yields three
types of results: known words, ambiguous seg-
ments, and unknown segments. Known words are
existing words in the lexicon. Ambiguous seg-
ments are caused by the overlapping of twoknown
words. Unknown segments are the combination of
characters which are not defined in the lexicon.
In this paper, we are interested in extracting
the unknown words with high precision and re-
call results. Three types of unknown words are
hidden, explicit and mixed (Kawtrakul et al. ,
1997). Hidden unknown words are composed by
different words existing in the lexicon. To illus-
trate the idea, let us consider an unknown word
ABCD where A, B, C, and D represents individ-
ual characters. Suppose that AB and CD both ex-
347
ist in a dictionary, then ABCD is considered as
a hidden unknown word. The explicit unknown
words are newly created words by using differ-
ent characters. Let us again consider an unknown
word ABCD. Suppose that there is no substring
of ABCD (i.e., AB, BC, CD, ABC, BCD) exists in
thedictionary, then ABCDisconsidered asexplicit
unknown words. The mixed unknown words are
composed of both existing words in a dictionary
and non-existing substrings. From the example of
unknown string ABCD, if there is at least one sub-
string of ABCD (i.e., AB, BC, CD, ABC, BCD) ex-
ists in the dictionary, then ABCD is considered as
a mixed unknown word.
It can be immediately seen that the detection of
the hidden unknown words are not trivial since the
parser would mistakenly assume that all the frag-
ments of the words are valid, i.e., previously de-
fined in the dictionary. In this paper, we limit our-
self to the extraction of the explicit and mixed un-
known words. This type of unknown words usu-
ally represent the transliteration of foreign words.
Detection of these unknown words could be ac-
complished mainly by using a word-segmentation
algorithm withamorphological analysis. Byusing
a dictionary-based word-segmentation algorithm,
locations of words which are not previously de-
fined in the lexicon could be easily detected.
4 The Proposed Framework
The overall framework is shown in Figure 1.
Two major components are information agent and
unknown-word analyzer. The details of each com-
ponent are given as follows.
• Information agent: This module is com-
posed of a Web crawler and an HTMLparser.
Itisresponsible for collecting HTMLsources
from the given URLs and extracting the tex-
tual data from the pages. Our framework is
designed to support multi-user and collabora-
tive environment. The advantage of this de-
sign approach is that unknown words could
be collected and verified more efficiently.
More importantly, it allows users to select the
Web pages which suit their interests.
• Unknown-word analyzer: This module is
composed ofmany components foranalyzing
and extracting unknown words. Word seg-
mentation module receives text strings from
the information agent and segments them
into a list of words. N-gram generation
module is responsible for generating hidden
unknown-word candidates. Morphological
analysis module is used to form initial ex-
plicit unknown-word segments. String pat-
tern matching unit performs unknown-word
boundary identification task. It takes the
intermediate unknown segments and iden-
tifies their boundaries by analyzing string
matching patterns The results are processed
unknown-word candidates which are pre-
sented to linguists for final post-processing
and verification. New unknown words are
combined with the dictionary to iteratively
improve the performance of the word seg-
mentation module. Details of each compo-
nent are given in the following subsections.
4.1 Unknown-Word Detection
As previously mentioned in Section 3, applying
a word-segmentation algorithm on a text string
yields three different segmented outputs: known,
ambiguous, and unknown segments. Since our
goal is to simply detect the unknown segments
without solving or analyzing other related issues
in word segmentation, using the longest-matching
word segmentation algorithm previously proposed
by Poowarawan (1986) is sufficient. An exam-
ple to illustrate the word-segmentation process is
given as follows.
Let the following string denotes a
text string written in Thai language:
{a1a2...aib1b2...bjc1c2...ck}. Suppose that
{a1a2...ai} and {c1c2...ck} are known words
from the dictionary, and {b1b2...bj} be an un-
known word. For the explicit unknown-word
case, applying the word-segmentation algo-
rithm would yield the following segments:
{a1a2...ai}{b1}{b2}...{bj}{c1c2...ck}. It can be
observed that the detected unknown positions for
a single unknown word are individual characters
in the unknown word itself. Based on the initial
statistical analysis of a Thai lexicon, it was found
that the averaged number of characters in a word
is equal to 7. This characteristic is quite different
from other non-segmenting languages such as
Chinese and Japanese in which a word could
be a character or a combination of only a few
characters. Therefore, to reduce the complexity
in unknown-word boundary identification task,
the unknown segments could be merged to
form multiple-character segments. For exam-
348
a0 a1 a2 a3
a4 a5 a3 a6 a7 a8 a9 a10 a5 a5 a2 a3 a7 a11 a12 a10 a5 a13 a14 a6 a7 a8
a15 a15 a15
a16 a17 a12 a18
a9 a10 a3 a1 a2 a3
a19 a2 a20
a21 a3 a10 a22 a23 a2 a3
a24 a25 a26 a27 a28 a29 a30 a31 a32 a27 a25
a33 a34 a35 a25 a31
a36 a25 a37 a25 a27 a38 a25 a39 a40 a27 a28 a41 a33 a25 a30 a42 a43 a44 a35 a28
a45 a11 a46 a3 a10 a47 a46 a2 a7 a2 a3 a10 a5 a6 a48 a7
a49 a6 a13 a5 a6 a48 a7 a10 a3 a50
a0 a7 a51 a7 a48 a22 a7
a19 a48 a3 a52 a1
a53 a54 a55 a56 a57 a58 a57a59
a60 a61 a62 a61 a63 a64 a61
a65 a66 a67 a68 a66 a61 a59 a69
a70 a57 a71 a71 a66 a61
a60 a61 a62 a61 a63 a64 a61
a65 a66 a67 a68 a66 a61 a59 a69
a60 a61 a62 a61 a63 a64 a61
a72 a63 a73 a71
a74 a75 a61 a71 a57 a71 a75 a59 a66 a69
a12 a48 a3 a76 a14 a48 a23 a48 a8 a6a13 a10 a23
a77 a7 a10 a23 a50 a1 a6 a1
a19 a48 a3 a52 a4 a2 a8 a47 a2 a7 a5 a10 a5 a6 a48 a7
a78 a79 a80 a79 a81 a82 a79
a83 a84 a85 a86 a84 a79 a87 a88 a82 a89a87 a90
a91 a81 a79 a87 a84 a92 a87 a88
Figure 1: The proposed framework for collecting Thai unknown words.
ple, a merging of two characters per segment
would give the following unknown segments:
{b1b2}{b3b4}...{bj−1bj}. In the following experi-
ment section, the merging of two to fivecharacters
per segment including the merging of all unknown
segments without limitation will be compared.
Morphological analysis is applied to guaran-
tee grammatically correct word boundaries. Sim-
ple morphological rules are used in the frame-
work. The rule set is based on two types of
characters, front-dependent characters and rear-
dependent characters. Front-dependent characters
are characters which must be merged to the seg-
ment leading them. Rear-dependent characters
are characters which must be merged to the seg-
ment following them. In Thai written language,
these dependent characters are some vowels and
tonal characters which have specific grammatical
constraints. Applying morphological analysis will
help making the unknown segments more reliable.
4.2 Unknown-Word Boundary Identification
Once the unknown segments are detected, they
are stored into a hashtable along with their con-
textual information. Our unknown-word bound-
ary identification approach is based on a string
pattern-matching algorithm previously proposed
by Boyer and Moore (1977). Consider the
unknown-word boundary identification as a string
pattern-matching problem, there are two possible
strategies: considering the longest matching pat-
tern and considering the most frequent matching
pattern as the unknown-word candidates. Both
strategies could beexplained moreformally asfol-
lows.
Given a set of N text strings, {S1S2...SN},
where Si, is a series of leni characters de-
noted by {ci,1ci,2...ci,leni} and each is marked
with an unknown-segment position, posi, where
1≤posi≤leni. Given a new string, Sj, with
an unknown-segment position, posj, the longest
pattern-matching strategy iterates through each
string, S1 to SN and records the longest string pat-
tern which occur in both Sj and the other string
in the set. On the other hand, the most fre-
quent pattern-matching strategy iterates through
each string, S1 to SN, but records the matching
pattern which occur most frequently.
The results from the unknown-word bound-
ary identification are unknown-word candidates.
These candidates are presented to the users for
verification. Our framework is implemented via
a Web-browser interface which provides a user-
friendly environment. Figure 2 shows a screen
snapshot of our system. Each unknown word is
listed within a text field box which allows a user to
edit and correct its boundary. The contexts could
be used as some editing guidelines and are also
stored into the database.
349
Figure 2: Example of Web-Based Interface
5 Experiments and Results
In this section, we evaluate the performance of
our proposed framework. The corpus used in the
experiments is composed of 8,137 newspaper ar-
ticles collected from a top-selling Thai newspa-
per’s Web site (Thairath, 2003) during 2003. The
corpus contains a total of 78,529 unknown words
of which 14,943 are unique. This corpus was
focused on unknown words which are transliter-
ated from foreign languages, e.g., English, Span-
ish, Japanese and Chinese. We use the publicly
available Thai dictionary LEXiTRON, which con-
tains approximately 30,000 words, in our frame-
work (Lexitron, 2006).
We first analyze the unknown-word set to ob-
serve its characteristics. Figure 3 shows the plot
of unknown-word frequency distribution. Not sur-
prisingly, the frequency of unknown-word usage
follows a Zipf-like distribution. This means there
areagroup ofunknown words whichareused very
often, while some unknown words are used only a
few times over a time period. Based on the fre-
quency statistics of unknown words, only about
3%(2,375 words outof 78,529) occur onlyonce in
thecorpus. Therefore, thisfindingsupports theuse
of statistical pattern-matching algorithm described
in previous section.
5.1 Evaluation of Unknown-Word Detection
Approaches
As discussed in Section 4, multiple unknown seg-
ments could be merged to form a representative
unknown segment. The merging will help reduce
the complexity in the unknown-word boundary
identification as fewer segments will be checked
for the same set of unknown words.
The following variations of merging approach
are compared.
• No merging (none): No merging process is
0 500 1000 1500 20000
100
200
300
400
500
600
Rank
Frequency
Figure 3: Unknown-word frequency distribution.
applied.
• N-character Merging (N-char): Allow the
maximum of N characters per segment.
• Merging all segments (all): No limit on num-
ber of characters per segment.
We measure the performance of unknown-word
detection task by using two metrics. The first is
the detection rate (or recall) which is equal to the
number of detected unknown words divided bythe
total number of previously tagged unknown words
in the corpus. The second is the averaged de-
tected positions per word. The second metric di-
rectly represents the overhead or the complexity
to the unknown-word boundary identification pro-
cess. This is because all detected positions from
a single unknown word must be checked by the
process. The comparison results are shown in Fig-
ure 4. As expected, the approach none gives the
maximum detection rate of 96.6%, while the ap-
proach all yields the lowest detection rate. An-
other interesting observation is that the approach
2-char yields comparable detection rate to the ap-
350
a93 a94 a95 a94 a96 a95 a97 a98 a99 a100 a101 a95 a94 a102 a103 a104
a105 a106 a94 a107 a101 a108 a94 a109 a93 a94 a95 a94 a96 a95 a94 a109 a110 a98 a111 a97 a95 a97 a98 a99 a111 a110 a94 a107 a112 a98 a107 a109
a113 a114 a115 a114 a113 a116 a115 a117 a113 a117 a115 a113 a113 a118 a115 a113 a113 a119 a115 a114 a120 a116 a115 a121
a116 a115 a113 a119 a115 a113 a117 a119 a115 a114 a117 a119 a115 a121 a114 a119 a115 a117 a118 a122 a115 a113
a123 a99 a124 a99 a98 a125 a99 a126 a127 a94 a108 a128 a94 a99 a95 a129 a94 a107 a108 a97 a99 a108 a105 a130 a130 a107 a98 a101 a96 a131
a99 a98 a99 a94 a132 a126 a96 a131 a101 a107 a133 a126 a96 a131 a101 a107 a134 a126 a96 a131 a101 a107 a135 a126 a96 a131 a101 a107 a101 a136 a136
Figure 4: Unknown-word detection results
proach none, however, its averaged detected posi-
tions per word is about three times lower. There-
foretoreduce thecomplexity during the unknown-
word boundary identification process, one might
want to consider using the merging approach of
2-char.
none 2−char 3−char 4−char 5−char all5
10
15
20
25
30
35
40
Unknown−Segment Merging Approach
Word−Boundary Identification Accuracy(%)
long
freq
freq−morph
Figure 5: Comparison between different
unknown-word boundary detection approaches.
5.2 Evaluation of Unknown-Word Boundary
Identification
The unknown-word boundary identification is
based on string pattern-matching algorithm. The
following variations of string pattern-matching
technique are compared.
• Longest matching pattern (long): Select the
longest-matching unknown-word candidate
• Most-frequent matching pattern (freq): Se-
lect the most-frequent-matching unknown-
word candidate
• Most-frequent matching pattern with mor-
phological analysis (freq-morph): Similar
the the approach freq but with additional
morphological analysis to guarantee that the
word boundaries are grammatically correct.
The comparison among all variations of string
pattern-matching approaches areperformed across
all unknown-segment merging approach. The re-
sults are shown in Figure 5. The performance met-
ric is the word-boundary identification accuracy
which is equal to the number of unknown words
correctly extracted divided by the total number
of tested unknown segments. It can be observed
that the selection of different merging approaches
doesnotreallyeffecttheaccuracy oftheunknown-
word boundary identification process. But since
the approach none generates approximately 6 po-
sitions per unknown segment on average, it would
be more efficient to perform a merging approach
which could reduce the number of positions down
by at least 3 times.
The plot also shows the comparison among
three approaches of string pattern-matching. Fig-
ure 6 summarizes the accuracy results of each
string pattern-matching approach by taking the av-
erage on alldifferent merging approaches. Theap-
proach long performed poorly with the averaged
accuracy of 8.68%. This is not surprising because
selection of the longest matching pattern does not
mean that its boundary will be identified correctly.
The approaches freq and freq-morph yield simi-
lar accuracy of about 36%. The freq-morph im-
proves the performance of the approach freq by
less than 1%. The little improvement is due to
the fact that the matching strings are mostly gram-
matically correct. However, the error is caused by
the matching collocations of the unknown-word
context. If an unknown word occurs together ad-
jacent to another word very frequently, they will
likely be extracted by the algorithm. Our solu-
tion to this problem is by providing the users with
a user-friendly interface so unknown-word candi-
dates could be easily filtered and corrected.
6 Conclusion
We proposed a framework for collecting Thai un-
known words from the Web. Our framework
351
a137 a138 a139 a140 a141 a142 a143 a144 a141 a142 a143 a144 a145 a146 a138 a142 a147 a148
a149 a150 a143 a142 a151 a140 a143 a152 a149 a153 a153 a154 a142 a151 a153 a155 a156 a157 a158 a159 a160 a161 a159 a162 a161 a160 a163 a159 a162 a161 a160 a164 a165
a166 a139 a167 a139 a138 a168 a139 a145 a169 a138 a142 a152 a170 a138 a154 a139 a152 a151 a142 a155 a171 a152 a143 a139 a172 a173 a141 a173 a153 a151 a172 a173 a138 a139 a149 a147 a147 a142 a138 a151 a153 a148
Figure 6: Unknown-word boundary identification results
is composed of an information agent and an
unknown-word analyzer. The task of the infor-
mation agent is to collect and extract textual data
from Web pages of given URLs. The unknown-
word analyzer involves two processes: unknown-
word detection and unknown-word boundary
identification. Due to the non-segmenting char-
acteristic of Thai written language, the unknown-
word detection is based on a word-segmentation
algorithm with a morphological analysis. To take
advantage of large available text resource from the
Web, the unknown-word boundary identification
is based on the statistical pattern-matching algo-
rithm.
We evaluate our proposed framework on a col-
lection of Web Pages obtained from a Thai news-
paper’s Web site. The evaluation is divided to test
each of the two processes underlying the frame-
work. For the unknown-word detection, the detec-
tion rate isfound tobe ashigh as 96%. Inaddition,
by merging a few characters into a segment, the
number of required unknown-word extraction is
reduced byatleast 3times, whilethedetection rate
is relatively maintained. For the unknown-word
boundary identification, considering the highest
frequent occurrence of string pattern is found to
be the most effective approach. The identification
accuracy was found to beas high as approximately
36%. The relatively low accuracy is not the major
concern since the unknown-word candidates are to
be verified and corrected by users before they are
actually added to the dictionary.
References
Masayuki Asahara and Yuji Matsumoto. 2004.
Japanese unknown word identification by character-
based chunking. Proceedings of the 20th Inter-
national Conference on Computational Linguistics
(COLING-2004), 459–465.
R. Boyer and S. Moore. 1977. A fast string searching
algorithm. Communications of the ACM, 20:762–
772.
Jing-Shin Chang and Keh-Yih Su. 1997. An Unsu-
pervised Iterative Method for Chinese New Lexicon
Extraction. International Journal of Computational
Linguistics & Chinese Language Processing, 2(2).
Keh-Jianne Chen and Ming-Hong Bai. 1998. Un-
known Word Detection for Chinese by a Corpus-
based Learning Method. Computational Linguistics
and Chinese Language Processing, 3(1):27–44.
Keh-Jianne Chen and Wei-Yun Ma. 2002. Unknown
Word Extraction for Chinese Documents. Proceed-
ings of the 19th International Conference on Com-
putational Linguistics (COLING-2002), 169–175.
Asanee Kawtrakul,Chalatip Thumkanon,Yuen Poovo-
rawan, Patcharee Varasrai, and Mukda Suktarachan.
1997. Automatic Thai Unknown Word Recogni-
tion. Proceedingsof the Natural Language Process-
ing Pacific Rim Symposium, 341–348.
Christopher S.G. Khoo and Teck Ee Loh. 2002. Us-
ing statistical and contextual information to iden-
tify two-and three-character words in Chinese text.
Journalof theAmericanSocietyfor InformationSci-
ence and Technology, 53(5):365–377.
Lexitron Version 2.1, Thai-English Dictionary. Source
available: http://lexitron.nectec.or.th, February
2006.
Tetsuji Nakagawa, Taku Kudoh and Yuji Matsumoto.
2001. Unknown Word Guessing and Part-of-Speech
Tagging Using Support Vector Machines. Proceed-
ings of the Sixth Natural Language Processing Pa-
cific Rim Symposium (NLPRS 2001), 325–331.
Jian-Yun Nie, Marie-Louise Hannan and Wanying Jin.
1995. Unknown Word Detection and Segmentation
of Chinese Using Statistical and Heuristic Knowl-
edge. Communications of COLIPS, 5(1&2):47–57.
Giorgos S. Orphanos and Dimitris N. Christodoulakis.
1999. POS Disambiguation and Unknown Word
Guessing with Decision Trees. Proceedings of the
EACL, 134–141.
Yuen Poowarawan. 1986. Dictionary-based Thai Syl-
lableSeparation. Proceedingsof the Ninth Electron-
ics Engineering Conference.
Thairath Newspaper. Source available:
http://www.thairath.com.
Janine Toole. 2000. Categorizing Unknown Words:
Using Decision Trees to Identify Names and Mis-
spellings. Proceeding of the 6th Applied Natu-
ral Language Processing Conference (ANLP 2000),
173–179.
352
