File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0701_intro.xml
Size: 3,416 bytes
Last Modified: 2025-10-06 14:07:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0701"> <Title>Unsupervised Learning of Word Boundary with Description Length Gain</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Detecting and handling unknown words properly has become a crucial issue in today's practical natural language processing (NLP) technology. No matter how large the dictionary that is used in a NLP system, there can be many new words in running/real texts, e.g., in scientific articles, newspapers and Web pages, that the dictionary does not include. Many such words are proper names and special terminology that provide critical information. It is unreliable to rest on delimiters such as white spaces to detect new lexical units, because many basic lexical items contain one or more spaces, e.g., as in &quot;New York&quot;, &quot;Hong Kong&quot; and &quot;hot dog&quot;. It appears that unsupervised learning techniques are necessary in order to alleviate the problem of unknown words in the NLP domain.</Paragraph> <Paragraph position="1"> There have been a number of studies on lexical acquisition from language data of different types. Wolff attempts to infer word boundaries from artificially-generated natural language sentences, heavily relying on the co-occurrence frequency of adjacent characters \[Wolff1975, Wolff 1977\]. Nevill-Manning's text compression program Sequitur can also identify word boundaries and gives a binary tree structure for an identified word \[Nevill-Mmming 1996\]. de Marcken explores unsupervised lexical acquisition from Enghsh spoken and written corpora and from a Chinese written corpus \[de Marken 1995: de Marken 1996\].</Paragraph> <Paragraph position="2"> In this paper, we present all unsupervised approach to lexical acquisition within the minimum description length (MDL) paradigm \[Rissanen 1978, Rissanen 1982\] \[Rissanen 1989\], with a goodness measure, namely, the description length gain (DLG), which is formulated in \[Kit 1998\] following classic information theory \[Shannon 1948, Cover and Thomas 1991\]. This measure is used, following the MDL principle, to evaluate the goodness of identifying a (sub)sequence of characters in a corpus as a lexical item. In order to rigorously evaluate the effectiveness of this unsupervised learning approach, we do not limit ourselves to the detection of unknown words with respect to ally given dictionary.</Paragraph> <Paragraph position="3"> Rather, we use it to perform unsupervised lexical acquisition from large-scale English text corpora. Since it is a learning-via-compression approach, the algorithm can be further extended to deal with text compression and, very likely, other data sequencing problems.</Paragraph> <Paragraph position="4"> The rest of the paper is organised as follows: Section 2 presents the formulation of the DLG mea-</Paragraph> <Paragraph position="6"> sure in terms of classic information theory; Section 3 formulates the learning algorithm within the MDL framework, which aims to achieve an optimal segmentation of the given corpus into lexical items with regard to the DLG measure; Section 4 presents experiments and discusses experimental results with respect to previous studies; and finally, the conclusions of the paper are given in Section 5.</Paragraph> </Section> class="xml-element"></Paper>