File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0124_metho.xml
Size: 25,808 bytes
Last Modified: 2025-10-06 14:14:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0124"> <Title>Analysis of Unknown Lexical Items using Morphological and Syntactic Information with the TIMIT Corpus</Title> <Section position="4" start_page="0" end_page="262" type="metho"> <SectionTitle> 2 Problem Description </SectionTitle> <Paragraph position="0"> A major problem that can occur when parsing sentences is the appearance of unknown words-words that are not contained in the lexicon of the system. As mentioned in Weischedel, et al. \[15\], the best-performing system at the Second Message Understanding Conference (MUC-2) simply halted parsing when an unknown word was encountered. Clearly, for a parser to be considered robust it must have mechanisms to process unknown words. The need to cope with unknown words will continue to grow as new words are coined, and words associated with sub-cultures leak into the main-stream vocabulary. In the case of a large corpus, especially one with no specific domain, a comprehensive lexicon is prohibitive. Thus, it is important that parsers have the ability to cope with new words.</Paragraph> <Paragraph position="1"> There are two broaxi approaches to handling unknown words. The first approach is to attempt to construct a complete lexicon, then deal with unknown words in a rudimentary way m for example, rejecting the input or interacting with the user to obtain the needed information about the unknown word. The second way is to attempt to analyze the word at the time of encounter with as little human interaction as possible. This would allow the parser to parse sentences containing unknown words in a robust and autonomous fashion. Unknown words could be learned by discovering their part of speech and feature information during parsing and storing that information in the lexicon. Thus, if the word is encountered again later, it now would be in the lexicon.</Paragraph> <Paragraph position="2"> Before examining the problem more fully, it is useful to consider work that has already been done on the problem. There have been several attempts to study the problem of learning unknown words. These attempts have followed several different methodologies and have focused on various aspects of the unknown words.</Paragraph> <Paragraph position="3"> Statistical methods are most commonly used in part-of-speech tagging. Charniak's paper \[5\] outlines the use of statistical equations in part-of-speech tagging. Tagging systems make only limited use of the syntactic knowledge inherent in the sentence, in contrast to parsers. An n-gram tagger concentrates on the n neighbors of a word (where n tends to be 2 or 3), ignoring the global sentence structure. Also, many part-of-speech tagging systems are only concerned with resolving ambiguity, not dealing with unknown words.</Paragraph> <Paragraph position="4"> Kupiec \[8\] and Brill \[3\] make use of morphology to handle unknown words during part-of-speech tagging. Brill's tagger begins by tagging unknown words as proper nouns if capitalized, common nouns if not. Then the tagger learns various transformational rules by training on a tagged corpus. It applies these rules to unknown words to tag them with the appropriate part-of-speech information. Kupiec's hidden Markov model uses a set of suffixes to assign probabilities and state transformations to unknown words. Both these methods work well, but they ignore the global syntactic content of the sentence. We will examine the effects of combining morphology and syntax, while using a deterministic system to perform parsing.</Paragraph> <Paragraph position="5"> Weischedel, et al \[15\] study the effectiveness of probablistic methods for part-of-speech tagging with unknown words. They show that using morphological information can increase the accuracy of their tagger on unknown words by a factor of five. They also briefly address the effect of unknown words in parsing with their part-of-speech tagger. However, all of their results assume that only one unknown word is present in each sentence or is within the tri-tag range of the tagger. They also assume that automatic disambiguation will eliminate extraneous parses.</Paragraph> <Paragraph position="6"> These assumptions will not always hold while parsing many corpora.</Paragraph> <Paragraph position="7"> In his thesis, Tomita \[13\] mentions the impact of syntax on determining the part of speech of unknown words during parsing. He suggests that his system can handle unknown words by simply assigning them all possible parts of speech (without using any morphological analysis of the words). He performs no experiment to assess his method's viability, but we will demonstrate that this is not a good approach. The use of all possible parts of speech will cause an exponential increase in the number of parses for a sentence as the number of unknown words increases.</Paragraph> <Paragraph position="8"> The FOUL-UP system \[6\], by Granger, is an example of a method that focuses on the use of context. This method assumes that most words are known, and that all sentences lie in a common semantic domain. In FOUL-UP, a strong top-down context in the form of a script is needed to provide the expected attributes of unknown words. The required use of a script to provide information limits the applicability of this method to situations where scripts are available.</Paragraph> <Paragraph position="9"> Jacobs and Zernik use a combination of methods in the SCISOR system \[7\]. Though this system performs morphological and syntactic analysis, SCISOR was designed to be used in a single domain. When new words are discovered, they tend to be given specialized meanings that are related to the semantic domain, limiting the system to that specific domain.</Paragraph> <Paragraph position="10"> There is a need for a parsing system that can act over less precisely defined domains and still efficiently cope with unknown words. We feel that the use of morphological recognition, a small lexicon of closed-class words, and a dictionary of known open-class words can be used to help our parser to determine the parts of speech for unknown words as they occur. For this research, we define a word by its parts of speech and a small set of features. These features include :</Paragraph> </Section> <Section position="5" start_page="262" end_page="265" type="metho"> <SectionTitle> 3 Techniques for Word Analysis </SectionTitle> <Paragraph position="0"> This section will cover the various concepts used by our parser to help in the processing of unknown words. First, the importance of closed-class words will be discussed. Second, the morphological recognition system is detailed. Third, the usefulness of syntactic knowledge is explained. Finally, the post-mortem method, which integrates these concepts into a parsing system, is described.</Paragraph> <Section position="1" start_page="262" end_page="263" type="sub_section"> <SectionTitle> 3.1 Closed-Class versus Open-Class Words </SectionTitle> <Paragraph position="0"> An important source of information that is used in this experiment is the distinction between closed-class and open-class words. This distinction can be used to develop a small core dictionary of closed-class words that can greatly ease the task of processing unknown words in a sentence.</Paragraph> <Paragraph position="1"> Closed-class parts of speech are those parts of speech that may not normally be assigned to new words. Closed-class words are words with a closed-class part of speech. For example, pronouns and determiners are members of the closed-class set of words; it is very rare that a new determiner or pronoun is added to the language. Closed-class parts of speech are such that all the words with that part of speech can be enumerated completely. In addition, these closed-class words are not generally used as other parts of speech. This research does not consider the meta-linguistic use of a word, as in this sentence: &quot;The ~ is a determiner.</Paragraph> <Paragraph position="2"> There are a number of closed-class parts of speech, including determiners, prepositions, conjunctions, predeterminers, and quantifiers. Pronouns and auxiliary verbs (be, have, and do) are also closed-class parl~ of speech. In addition, we designate irregular verbs as closed-class words for this research .~;ince they are a static part of the language. The irregular verbs are enumerated by Quirk, et al \[12\]. Typically new verbs in a language are not coined as irregulax verbs. The enumeration of irregular verbs allows the recognition of unknown verb forms to be rule-based. For example~, all past tense regular verbs end in -ed, third person singular regular verbs end in -s, and so on. For similar reasons, irregular noun plurals are included in the set of closed-class words.</Paragraph> <Paragraph position="3"> Open-class parts of speech are those parts of speech that accept the addition of new items with little difficulty. Open-class words are comprised of words with the following parts of speech: nouns, verbs, adjectives, and adverbs. Noun modifier is a fifth part of speech that is used in this research to indicate those words that can be used to modify nouns; this will eliminate extraneous parses that occur when a word defined as both a noun and adjective is used to modify a head noun. This part of speech is also used by Cardie \[4\] in her experiments.</Paragraph> <Paragraph position="4"> The existence of a set of closed-class words allows the construction of a dictionary in such a way as to facilitate the detection and analysis of unknown words. By creating a dictionary containing all the closed-class words, some words in any sentence will very likely be known.</Paragraph> <Paragraph position="5"> A limit is also placed on the possible parts of speech of unknown words. They cannot have a closed-class part of speech, since the closed-class words are enumerated in the dictionary. So unknown words are assumed to be open class, restricting them to noun, verb, adjective, adverb, or noun modifier.</Paragraph> </Section> <Section position="2" start_page="263" end_page="264" type="sub_section"> <SectionTitle> 3.2 Morphological Recognition </SectionTitle> <Paragraph position="0"> For convenience, we split the field of morphology into three different areas--morphological generation, morphological reconstruction, and morphological recognition. Morphological generation research examines the ability of morphological affixation rules to generate new words from a lexicon of base roots. Theoretical linguists and psychologists are interested in morphological generation for its use in linguistic theory or in understanding how people learn a language. For example, Muysken \[11\] studied the classification of affixes according to their order of application for a theoretical discussion on morphology. Badecker and Caramazza \[2\] discussed the distinction between inflectional and derivational morphology as it applies to acquired language deficit disorder, and, in general, to the theory of language learning. Baayen and Lieber \[1\] studied the productivity of certain English affixes in the CELEX lexical database, in an effort to study the differences between frequency of appearance and productivity. Viegas, et al \[14\] show that the use of lexical rules and morphological generation can greatly aid in the task of lexical acquisition. However, morphological generation involves the construction of new word forms by applying rules of afiixation to base forms, and so it is only indirectly helpful in the analysis of unknown words.</Paragraph> <Paragraph position="1"> Morphological reconstruction researchers process an unknown word by using knowledge of the root stem and affixes of that word. For example, Milne \[10\] makes use of morphological reconstruction to resolve lexical ambiguity while parsing. Light \[9\] uses morphological cues to determine semantic features of words by using various restrictions and knowledge sources.</Paragraph> <Paragraph position="2"> Jacobs and Zernik \[7\] make use of morphology in their case study of lexical acquisition, in which they attempt to augment their lexicon using a variety of knowledge sources. However, they assume a large dictionary of general roots is available, and that the unknown words tend to have specialized meanings. Morphological reconstruction research relies on the presence of stern</Paragraph> <Paragraph position="4"> information, making morphological reconstruction of less value for coping with unknown words.</Paragraph> <Paragraph position="5"> If an unknown word is encountered, the root of that word is likely to also be unknown. A method to cope with unknown words cannot be based on knowledge of the root if the root is also unknown.</Paragraph> <Paragraph position="6"> Morphological recognition uses knowledge about affixes to determine the possible parts of speech and other features of a word, without utilizing any direct information about the word's stem. There has been little research done in this area. As noted above, many of the uses of morphology for analysis of lexical items require more knowledge than may be possible for some applications, especially if the system will be using a limited lexicon but will be expected to cope with words in the language but not found in the lexicon.</Paragraph> <Paragraph position="7"> There is one caveat concerning the use of only affix information in a morphological recognizer. Since the root of an unknown word is assumed to be unknown, the recognizer can only consider whether an ~ffix matches the word. This can lead to an interesting type of error. For example, while checking the word butterfly, the ~ffix -ly matches as a suffix. Typically, the -ly affix is attached to adjectives to form adverbs (e.g., happy --+ happily), or to nouns to form adjectives (e.g., beast --+ beastly); however, the word butterfly was not formed by this process, but rather by compounding the words butter and fly. Without the knowledge that *butterf is not an acceptable root word, and without some notion of legal word structure, there is no way to determine that butterfly was not formed by applying -ly. So in this case, butterfly is mistakenly assumed to have the -ly suffix. By using additional information, a morphological recognizer could circumvent this problem. However, we still believe that a simple morphological recognition module can be useful.</Paragraph> <Paragraph position="8"> The question is: how effective can it be? For the morphological recognition module in this system, we constructed a list of suffixes and prefixes by hand, using lists found in Quirk, et al \[12\]. Of the two, suffixes general offer more effective constraints on the possible parts of speech of a word. For each of these affixes, we constructed two lists of parts of speech. The first-choice list contains those parts of speech that are most likely to be found in words ending (suffix) or beginning (prefix) with that affix.</Paragraph> <Paragraph position="9"> The second-choice list contains those parts of speech that are fairly likely to be found in words ending or beginning with that affix. The first-choice list is a subset of the second-choice list.</Paragraph> <Paragraph position="10"> The two lists for each affix were created by hand, using rules described in Quirk, et al \[12\]. The creation of two separate lists is used by the post-mortem parsing approach of our experimental system, and the use of the two lists will be detailed in that section.</Paragraph> </Section> <Section position="3" start_page="264" end_page="265" type="sub_section"> <SectionTitle> 3.3 Syntactic Knowledge </SectionTitle> <Paragraph position="0"> Syntactic knowledge is used implicitly in this research by the parser. When an unknown word is encountered in a sentence, its context in that sentence can be of great importance when predicting its part of speech. For example, a word directly following a determiner is typically a noun or noun modifier. Consider the unknown word smolked used in this sentence: The cat smolked the dog.</Paragraph> <Paragraph position="1"> Assuming that all the other words in the sentence are in the lexicon, then based on purely syntactic knowledge the unknown word must be a finite tense verb, either past or present tense. However, the use of morphological recognition can refine this information. The -ed ending on smolked indicates either a past tense verb or a past participle. By combining the syntactic and morphological information, the word smolked is identified as a past tense verb. Further, it is a verb which takes a noun phrase as its object. Thus, the syntactic information can augment the morphological information, and vice versa. Obviously, the more words in a sentence that are defined in the lexicon, the more the syntactic knowledge can limit possible parts of speech of unknown words.</Paragraph> </Section> <Section position="4" start_page="265" end_page="265" type="sub_section"> <SectionTitle> 3.4 Sentence Parsing with Post-Mortem Analysis </SectionTitle> <Paragraph position="0"> The information sources in the preceding three sections are combined in our experimental post-mortem parsing system. The following approach is used to allow our parser to cope with unknown words in a sentence:</Paragraph> </Section> </Section> <Section position="6" start_page="265" end_page="265" type="metho"> <SectionTitle> 1. COMBINED LEXICON AND FIRST-CHOICE MORPHOLOGICAL RECOGNIZER </SectionTitle> <Paragraph position="0"> The lexicon ks consulted for each word in the sentence. If the word ks defined in the lexicon, its definition--consisting of the word, its part of speech, and various features affecting its us~ is used to parse the sentence. If it is not in the lexicon, we assume that it can only be an open-class part of speech, and it could possibly be any of them. In this pass, we use the morphological recognizer to reduce the number of possible parts of speech for the word. Consulting a list of affixes, the recognizer determines which affix, if any, are present in the word. Then the recognizer assigns all the parts of speech from the first-choice list to that word. For example, an unknown word ending in -bj is assumed to be an adverb. A parse forest for the sentence is generated. If the parse fails, the parser moves to pass two.</Paragraph> </Section> <Section position="7" start_page="265" end_page="267" type="metho"> <SectionTitle> 2. COMBINED LEXICON AND SECOND-CHOICE MORPHOLOGICAL RECOGNIZER </SectionTitle> <Paragraph position="0"> If the sentence fails its first attempt to parse, then it ks reparsed using the second-choice list for the affix, instead of the first-choice list. This will assign a more liberal set of parts of speech to the word based on its affix. For example, an unknown word ending in -ly is now assumed to be an adverb, adjective, and a modifier. Again, the parse forest for the sentence ks returned. If the sentence fails to parse, the parser goes on to pass three. 3. ALL OPEN-CLASS VARIANTS FOR UNKNOWN WORDS If the sentence falls its second parse attempt, it is reparsed assigning all open-class lexical categories to every unknown word. For example, an unknown word ending in -ly is now assumed to be a noun, verb, adverb, adjective, and modifier. The parse forest for the sentence is returned. If the sentence fails to parse in this pass, the parser moves on to pass four.</Paragraph> <Paragraph position="1"> . ALL MORPHOLOGICAL VARIANTS FOR ALL OPEN-CLASS WORDS If the sentence fails its third parse attempt, it is reparsed using the morphological recognizer on all open-class words. It assigns a set of parts of speech to each word, again based on the second-choice list for that word's affixes. Note that only the closed-class lexicon is consulted during this attempt to parse. The morpholo~cal recognizer is used to determine definitions for all open-class words. This approach allows the parsing system to find new definitions for words that are already in the dictionary. For example, any word ending in -ly is now assumed to be an a~Iverb, an adjective, and a modifier. The parse forest for the sentence is returned. If the sentence fails to parse in this pass, the parser moves On to pass ~ve.</Paragraph> <Paragraph position="2"> . ALL 0PEN-CLASS VARIANTS If the sentence still fails to parse, all possible open-class parts of speech are given to all open-class words in the sentence. Again, only the closed-class lexicon is consulted. For example, any word ending in -ly is now assumed to be a noun, verb, modifier, adjective, and adverb. The parse forest is returned, or the sentence fails completely.</Paragraph> <Paragraph position="3"> This system is a post-mortem error handling technique---if (and only if) the sentence fails to parse, the parser tries again, using a more liberal interpretation in its word look-up algorithm. The idea is to limit the possibilities in the beginning to those that are most likely~ and broaden the search space later if the first methods fail. The use of this approach will limit the possible parts of speech for many unknown words.</Paragraph> <Paragraph position="4"> I 4 Description of Experiment The parsing system used in this experiment is based on a Tomita \[13\] style parser. It is an LR-type parser, using a shift-reduce algorithm with packed parse forests. It is implemented in Common Lisp. The test corpus is a set of 356 sentences from the Timit corpus \[16\], a corpus of sentences that has no underlying semantic theme. It was originally designed as a corpus for training phoneme recognition for speech processing. This corpus was specifically chosen for our tests because it has no theme, and would thus offer a wider range of sentence types. The rule set was designed first to properly parse the test corpus, and second to be as general as possible. It is hoped that these rules could deal with a wide variety of English sentences, though they have not been tested on other corpora. The issue of the size of the grammar is not addressed in this experiment. Also, the possible failure of the parser clue to insufficient rule coverage is not considered. There are many applications that use grammars that do not cover an extensive range of English sentences, and these applications would benefit from our mechanisms for dealing with unknown words.</Paragraph> <Paragraph position="5"> There are four separate files that constitute the lexicon for this parser and corpus: the corpus that are not contained in the other three files and are thus all open-class words. Dict9 contains 90% of the words from dictl0, dict8 contains 80%, and so on down to dictl, which contains 10% of these words. All these files were created randomly and * independently using the words in dictl0. These percentages are based on a word count, not on a definition count--many words have more than one definition. For example, acts is one word, but it has two definitions--as a noun and a verb.</Paragraph> <Paragraph position="6"> For our experiment, we perform two data runs. The first run uses a variation Tomita's method; it assigns all possible open-class parts of speech (noun, verb, adjective, adverb, and modifier) to unknown words. This data run is called the baseline run. The second run assigns parts of speech using the post-mortem approach described in Section 3.4. This data run is called the experimental run.</Paragraph> <Paragraph position="7"> For each test run, the sentences in the corpus are parsed by the system eleven times. Each time through the corpus, all the dosed-class words are loaded into the lexicon. In each separate run, a different open-class dictionary is used. For the first run, the full dictionary found in dictl0 is used. This run is used as a control, since all words in the corpus are defined in the lexicon. For each successive run, the next dictionary is used, from dict9 down to dictl. Finally, the eleventh run is done without loading any extra dictionary files--only the three dosed-class files are used.</Paragraph> <Paragraph position="8"> After all of the parse trees have been generated, each run is compared to the control run, and three numbers are calculated for each sentence in each run--the number of matches~ deletions, and insertions. A match occurs when the sentence has generated a parse that occurs within the control parse forest for that sentence. A deletion occurs when the sentence has failed to produce a parse that occurs in the: control parse forest for that sentence. An insertion occurs when the sentence produces a parse that does not occur in the control parse forest for that sentence. For example, assume sentence ~16 produces parses A, B, C, and D in the first, or control, run. In a later runt sentence ~16 produces parses A, C, E, F, and G. Then, for sentence #16, we have two matches (A and C), two deletions (B and D), and three insertions (E, F, and G). By using these measurements, we can determine the precision and recall of the parsing system when parsing sentences with unknown words.</Paragraph> <Paragraph position="9"> The issue of disambiguation is not explicitly dealt with in this experiment. We wish to see how well the morphological recognizer can replicate the performance of a parser with a full dictionary. This is demonstrated in our use of the match, insertion, and deletion numbers above. The number of matches is important, since ideally we want the recognizer to return all possible parses that occur when the full dictionary is used. The issue of which of these parses is the correct one would require that we utilize semantic, pragmatic, and contextual information to select the correct parse, a topic beyond the scope of our experiment.</Paragraph> </Section> class="xml-element"></Paper>