File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/p93-1034_intro.xml
Size: 2,990 bytes
Last Modified: 2025-10-06 14:05:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P93-1034"> <Title>PART-OF-SPEECH INDUCTION FROM SCRATCH</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> Part-of-speech information about individual words is necessary for any kind of syntactic and higher level processing of natural language. While it is easy to obtain lists with part of speech labels for frequent English words, such information is not available for less common languages. Even for English, a categorization of words that is tailored to a particular genre may be desired. Finally, there are rare words that need to be categorized even if frequent words are covered by an available electronic dictionary.</Paragraph> <Paragraph position="1"> This paper presents a method for inducing the parts of speech of a language and part-of-speech labels for individual words from a large text corpus. Little, if any, language-specific knowledge is used, so that it is applicable to any language in principle. Since the part-of-speech representations are derived from the corpus, the resulting categorization is highly text specific and doesn't contain categories that are inappropriate for the genre in question. The method is efficient enough for vocabularies of tens of thousands of words thus addressing the problem of coverage.</Paragraph> <Paragraph position="2"> The problem of how syntactic categories can be induced is also of theoretical interest in language acquisition and learnability. Syntactic category information is part of the basic knowledge about language that children must learn before they can acquire more complicated structures. It has been claimed that &quot;the properties that the child can detect in the input - such as the serial positions and adjacency and co-occurrence relations among words - are in general linguistically irrelevant.&quot; (Pinker 1984) It will be shown here that relative position of words with respect to each other is sufficient for learning the major syntactic categories. In the first part of the derivation, two iterations of a massive linear approximation of cooccurrence counts categorize unambiguous words. Then a neural net trained on these words classifies individual contexts of occurrence of ambiguous words.</Paragraph> <Paragraph position="3"> An evaluation suggests that the method classifies both ambiguous and unambiguous words correctly. It differs from previous work in its efficiency and applicability to large vocabularies; and in that linguistic knowledge is only used in the very last step so that theoretical assumptions that don't hold for a language or sublanguage have minimal influence on the classification.</Paragraph> <Paragraph position="4"> The next two sections describe the linear approximation and a birecurrent neural network for the classification of ambiguous words. The last section discusses the results.</Paragraph> </Section> class="xml-element"></Paper>