XML Viewer - p99-1036

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/99/p99-1036_relat.xml
Size: 2,486 bytes
Last Modified: 2025-10-06 14:16:11
<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1036">
  <Title>A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context</Title>
  <Section position="8" start_page="282" end_page="283" type="relat">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> Since English uses spaces between words, unknown words can be identified by simple dictionary lookup.</Paragraph>
    <Paragraph position="1"> So the topic of interest is part of speech estimation.</Paragraph>
    <Paragraph position="2"> Some statistical model to estimate the part of speech of unknown words from the case of the first letter and the prefix and suffix is proposed (Weischedel et al., 1993; Brill, 1995; Ratnaparkhi, 1996; Mikheev, 1997). On the contrary, since Asian languages like Japanese and Chinese do not put spaces between words, previous work on unknown word problem is focused on word segmentation; there are few studies estimating part of speech of unknown words in Asian languages.</Paragraph>
    <Paragraph position="3"> The cues used for estimating the part of speech of unknown words for Japanese in this paper are basically the same for English, namely, the prefix and suffix of the unknown word as well as the previous and following part of speech. The contribution of this paper is in showing the fact that different character sets behave differently in Japanese and a better word model can be made by using this fact.</Paragraph>
    <Paragraph position="4"> By introducing different length models based on character sets, the number of decomposition errors of unknown words are significantly reduced. In other words, the tendency of over-segmentation is corrected. However, the spelling model, especially the character bigrams in Equation (17) are hard to estimate because of the data sparseness. This is the main reason of the remaining under-segmented and over-segmented errors.</Paragraph>
    <Paragraph position="5"> To improve the unknown word model, feature-based approach such as the maximum entropy method (Ratnaparkhi, 1996) might be useful, because we don't have to divide the training data into several disjoint sets (like we did by part of speech and word type) and we can incorporate more linguistic and morphological knowledge into the same probabilistic framework. We are thinking of reimplementing our unknown word model using the maximum entropy method as the next step of our research.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML