File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2027_metho.xml
Size: 5,965 bytes
Last Modified: 2025-10-06 14:08:16
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2027"> <Title>Bayesian Nets in Syntactic Categorization of Novel Words</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A DBN for PoS Tagging </SectionTitle> <Paragraph position="0"> Unlike Toutanova[2002], we deliberately base our model on the original feature set of Ratnaparkhi's MaxEnt. Our Bayesian network includes a set of binary features (1-3, below) and a set of vocabulary features (4-6, below). The binary features indicate the presence or absence of a particular character in the token: 1. does the token contain a capital letter; 2. does the token contain a hyphen; 3. does the token contain a number. We used Ratnaparkhi's vocabulary lists to encode the values of 6458 frequent Words, 3602 Prefixes and 2925 Suffixes up to 4 letters long.</Paragraph> <Paragraph position="1"> A Dynamic Bayesian network (DBN) is a Bayesian network unwrapped in time, such that it can represent dependencies between variables at adjacent positions (see figure). For a good overview of DBNs, see Murphy [2002]. The set of observable variables in our network consists of the binary and vocabulary features mentioned above. In addition, there are two hidden variables: PoS and Memory which reflects contextual information about past PoS tags. Unlike Ratnaparkhi we do not directly consider any information about preceding words even the previous one [Toutanova 2002]. However, a special value of Memory indicates whether we are at the beginning of the sentence.</Paragraph> <Paragraph position="2"> Learning in our model is equivalent to collecting statistics over co-occurrences of feature values and tags. This is implemented in GAWK scripts and takes minutes on the WSJ training corpus.</Paragraph> <Paragraph position="3"> Compare this to laborious Improved Iterative Scaling for MaxEnt. Tagging is carried out by the standard Forward-Backward algorithm (see e.g.</Paragraph> <Paragraph position="4"> Murphy[2002]). We do not need to use specialized search algorithms such as Ratnaparkhi's &quot;beam search&quot;. In addition, our method does not require a &quot;Development&quot; stage.</Paragraph> <Paragraph position="5"> Following established data split we use sections (0-22) of WSJ for training and the rest (23-24) as a test set. The test sections contain 4792 sentences out of about 55600 total sentences in WSJ corpus.</Paragraph> <Paragraph position="6"> The average length of a sentence is 23 tokens. In addition, we created two specialized testing corpora (available upon request for comparison purposes). A small Email corpus was prepared from excerpts from the MUC seminar announcement corpus. &quot;The Jabberwocky&quot; is a poem by Louis Carol where the majority of words are made-up, but their PoS tags are apparent to speakers of English. We use &quot;The Jabberwocky&quot; to illustrate performance on unknown words. Both the Email corpus and the Jabberwocky were pre-tagged by the Brill tagger then manually corrected.</Paragraph> <Paragraph position="7"> We began our experiments by using the original set of features and vocabulary lists of Ratnaparkhi for the variables Word, Prefix and Suffix. This produced a reasonable performance. While investigating the relative contribution of each feature in this setting, we discovered that the removal of the three binary features from the feature set does not significantly alter performance. Upon close examination, the vocabularies we used turned out to contain a lot of redundant information that is otherwise handled by these features. For example, Prefix list contained 84 hyphens (e.g. both &quot;co-&quot; and &quot;co&quot;), 530 numbers and 150 capitalised words, including capital letters. We proceed, using reduced vocabularies obtained by removing redundant information from the original lists. The results are presented in Table 1 for various testing conditions. Since Toutanova[2002] report that Prefix information worsens performance, we conducted the second set of experiments with a network that contained no information about prefix. We found no significant change in performance.</Paragraph> <Paragraph position="8"> Our overall performance is comparable to the best result known on this benchmark (e.g. Toutanova[2002]. At the same time, our performance on OoV words is significantly better (9.4% versus 13.3%). We attribute this difference to the purer representation of morphologically relevant suffixes in our factored vocabulary, which excludes redundant and therefore potentially confusing information. Another reason may be that our method puts a greater emphasis on the syntactically relevant facts, such as morphology and tag sequence information by refraining to use word-specific cues.</Paragraph> <Paragraph position="9"> Despite our good performance on the WSJ corpus, we failed to improve Brill's tagging on our two specialized corpora. Both Brill and our method achieved 89% on the Jabberwocky poem. Note, however, that Brill uses much more sophisticated mechanisms to obtain this result. It was particularly disappointing for us to find out that we did not succeed in labeling the Email corpus accurately (16.3% versus 14.9% of Brill). However, the reason for this poor performance appears to be partly related to a labeling convention of the Penn Treebank, which essentially causes most capitalized words to be categorized as NNPs. In our view, there is a significant difference between the grammatical status of a proper name &quot;Virginia Savova&quot;, where words can't be said to modify one another, and a name of an institution such as &quot;Department of Chemical Engineering&quot;, where &quot;chemical&quot; clearly modifies &quot;engineering&quot;. While a rule-based system profits from this simplistic convention, our method is harmed by it.</Paragraph> </Section> class="xml-element"></Paper>