File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/p05-2001_concl.xml

Size: 3,439 bytes

Last Modified: 2025-10-06 13:54:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2001">
  <Title>Hybrid Methods for POS Guessing of Chinese Unknown Words</Title>
  <Section position="10" start_page="4" end_page="4" type="concl">
    <SectionTitle>
6 Discussion and Conclusion
</SectionTitle>
    <Paragraph position="0"> The results indicate that the three models have different strengths and weaknesses. Using rules that do not overgenerate and that are sensitive to the type, length, and internal structure of unknown words,  the rule-based model achieves high precision for all words and high recall for longer words, but recall for disyllabic words is low. The trigram model makes use of the contextual information of unknown words and solves the recall problem, but its precision is relatively low. Wu and Jiang's (2000) model complements the other two, as it achieves a higher recall than the rule-based model and a higher precision than the trigram model for disyllabic words. The combined model outperforms each individual model by effectively combining their strengths.</Paragraph>
    <Paragraph position="1"> The results challenge the reasons given in previous studies for rejecting the rule-based model. Over-generation is a problem only if one attempts to write rules to cover the complete set of unknown words. It can be controlled if one prefers precision over recall. To this end, the internal structure of the unknown words provides very useful information. Results for the rule-based model also suggest that as unknown words become longer and the fluidity of their component words/morphemes reduces, they become more predictable and generalizable by rules.</Paragraph>
    <Paragraph position="2"> The results achieved in this study prove a significant improvement over those reported in previous studies. To our knowledge, the best result on this task was reported by Chen et al. (1997), which was 69.13%. However, they considered fourteen POS categories, whereas we examined only eight. This difference is brought about by the different tagsets used in the different corpora and the decision to include or exclude proper names and numeric type compounds. To make the results more comparable, we replicated their model, and the results we found were consistent with what they reported, i.e., 69.12% for our training data and 68.79% for our test data, as opposed to our 89.32% and 89% respectively. null Several avenues can be taken for future research.</Paragraph>
    <Paragraph position="3"> First, it will be useful to identify a statistical model that achieves higher precision for disyllabic words, as this seems to be the bottleneck. It will also be relevant to apply advanced statistical models that can incorporate various useful information to this task, e.g., the maximum entropy model (Ratnaparkhi, 1996). Second, for better evaluation, it would be helpful to use a larger corpus and evaluate the individual models on a held-out dataset, to compare our model with other models on more comparable datasets, and to test the model on other logographic languages. Third, some grammatical constraints may be used for the detection and correction of tagging errors in a post-processing step. Finally, as part of a bigger project on Chinese unknown word resolution, we would like to see how well the general methodology used and the specifics acquired in this task can benefit the identification and sense-tagging of unknown words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML