File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-2001_intro.xml
Size: 2,302 bytes
Last Modified: 2025-10-06 14:03:09
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2001"> <Title>Hybrid Methods for POS Guessing of Chinese Unknown Words</Title> <Section position="3" start_page="0" end_page="1" type="intro"> <SectionTitle> 2 Chinese Unknown Words </SectionTitle> <Paragraph position="0"> The definition of what constitutes a word is problematic for Chinese, as Chinese does not have word delimiters and the boundary between compounds and phrases or collocations is fuzzy. Consequently, different NLP tasks adopt different segmentation schemes (Sproat, 2002). With respect to any Chinese corpus or NLP system, therefore, unknown words can be defined as character strings that are not in the lexicon but should be identified as segmentation units based on the segmentation scheme.</Paragraph> <Paragraph position="1"> Chen and Bai (1998) categorized Chinese unknown words into the following five types: 1) acronyms, i.e., shortened forms of long names, e.g., bVei-d`a for bVeij-ing-d`axu'e 'Beijing University'; 2) proper names, including person, place, and organization names, e.g., M'ao-Z'ed-ong; 3) derived words, which are created through affixation, e.g., xi`and`ai-hu`a 'modernize'; 4) compounds, which are created through compounding, e.g., zhVi-lVaohVu 'paper tiger'; and 5) nu- null meric type compounds, including numbers, dates, time, etc., e.g., liVang-diVan 'two o'clock'. Other types of unknown words exist, such as loan words and reduplicated words. A monosyllabic or disyllabic Chinese word can reduplicate in various patterns, e.g., zVou-zVou 'take a walk' and pi`ao-pi`aoli`ang-li`ang 'very pretty' are formed by reduplicating zVou 'walk' and pi`ao-li`ang 'pretty' respectively.</Paragraph> <Paragraph position="2"> The identification of acronyms, proper names, and numeric type compounds is a separate task that has received substantial attention. Once a character string is identified as one of these, its POS category also becomes known. We will therefore focus on reduplicated and derived words and compounds only. We will consider unknown words of the categories of noun, verb, and adjective, as most unknown words fall under these categories (Chen and Bai, 1998). Finally, monosyllabic words will not be considered as they are well covered by the lexicon.</Paragraph> </Section> class="xml-element"></Paper>