File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1089_intro.xml

Size: 5,456 bytes

Last Modified: 2025-10-06 14:03:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1089">
  <Title>Guessing Parts-of-Speech of Unknown Words Using Global Information</Title>
  <Section position="3" start_page="0" end_page="705" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Part-of-speech (POS) tagging is a fundamental language analysis task. In POS tagging, we frequently encounter words that do not exist in training data. Such words are called unknown words.</Paragraph>
    <Paragraph position="1"> They are usually handled by an exceptional process in POS tagging, because the tagging system does not have information about the words.</Paragraph>
    <Paragraph position="2"> Guessing the POS tags of such unknown words is a difficult task. But it is an important issue both for conducting POS tagging accurately and for creating word dictionaries automatically or semiautomatically. There have been many studies on POS guessing of unknown words (Mori and Nagao, 1996; Mikheev, 1997; Chen et al., 1997; Nagata, 1999; Orphanos and Christodoulakis, 1999).</Paragraph>
    <Paragraph position="3"> In most of these previous works, POS tags of unknown words were predicted using only local information, such as lexical forms and POS tags of surrounding words or word-internal features (e.g. suffixes and character types) of the unknown words. However, this approach has limitations in available information. For example, common nouns and proper nouns are sometimes difficult to distinguish with only the information of a single occurrence because their syntactic functions are almost identical. In English, proper nouns are capitalized and there is generally little ambiguity between common nouns and proper nouns.</Paragraph>
    <Paragraph position="4"> In Chinese and Japanese, no such convention exists and the problem of the ambiguity is serious.</Paragraph>
    <Paragraph position="5"> However, if an unknown word with the same lexical form appears in another part with informative local features (e.g. titles of persons), this will give useful clues for guessing the part-of-speech of the ambiguous one, because unknown words with the same lexical form usually have the same part-of-speech. For another example, there is a part-of-speech named sahen-noun (verbal noun) in Japanese. Verbal nouns behave as common nouns, except that they are used as verbs when they are followed by a verb &amp;quot;suru&amp;quot;; e.g., a verbal noun &amp;quot;dokusho&amp;quot; means &amp;quot;reading&amp;quot; and &amp;quot;dokusho-suru&amp;quot; is a verb meaning to &amp;quot;read books&amp;quot;. It is difficult to distinguish a verbal noun from a common noun if it is used as a noun. However, it will be easy if we know that the word is followed by &amp;quot;suru&amp;quot; in another part in the document. This issue was mentioned by Asahara (2003) as a problem of possibility-based POS tags. A possibility-based POS tag is a POS tag that represents all the possible properties of the word (e.g., a verbal noun is used as a noun or a verb), rather than a property of each instance of the word. For example, a sahen-noun is actually a noun that can be used as a verb when it is followed by &amp;quot;suru&amp;quot;. This property cannot be confirmed without observing real usage of the word appearing with &amp;quot;suru&amp;quot;. Such POS tags may not be identified with only local information of one instance, because the property that each instance has is only one among all the possible properties. null To cope with these issues, we propose a method that uses global information as well as local information for guessing the parts-of-speech of unknown words. With this method, all the occurrences of the unknown words in a document1 are taken into consideration at once, rather than that each occurrence of the words is processed separately. Thus, the method models the whole document and finds a set of parts-of-speech by maximizing its conditional joint probability given the document, rather than independently maximizing the probability of each part-of-speech given each sentence. Global information is known to be useful in other NLP tasks, especially in the named entity recognition task, and several studies successfully used global features (Chieu and Ng, 2002; Finkel et al., 2005).</Paragraph>
    <Paragraph position="6"> One potential advantage of our method is its  ability to incorporate unlabeled data. Global features can be increased by simply adding unlabeled data into the test data.</Paragraph>
    <Paragraph position="7"> Models in which the whole document is taken into consideration need a lot of computation compared to models with only local features. They also cannot process input data one-by-one. Instead, the entire document has to be read before processing. We adopt Gibbs sampling in order to compute the models efficiently, and these models are suitable for offline use such as creating dictionaries from raw text where real-time processing is not necessary but high-accuracy is needed to reduce human labor required for revising automatically analyzed data.</Paragraph>
    <Paragraph position="8"> The rest of this paper is organized as follows: Section 2 describes a method for POS guessing of unknown words which utilizes global information.</Paragraph>
    <Paragraph position="9"> Section 3 shows experimental results on multiple corpora. Section 4 discusses related work, and Section 5 gives conclusions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML