File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0107_intro.xml
Size: 5,108 bytes
Last Modified: 2025-10-06 14:06:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0107"> <Title>Reestimation and Best-First Parsing Algorithm for Probabilistic Dependency Grammars</Title> <Section position="2" start_page="0" end_page="41" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> There have been many efforts to induce grammars automatically from corpus by utilizing the vast amount of corpora with various degrees of annotations. Corpus-based, stochastic grammar induction has many profitable advantages such as simple acquisition and extension of linguistic knowledges, easy treatment of ambiguities by virtue of its innate scoring mechanism, and fail-soi~ reaction to ill-formed or extra-grammatical sentences.</Paragraph> <Paragraph position="1"> Most of corpus-based grammar inductions have concentrated on phrase structure gramdeg mars (Black, Lafferty, and Roukos, 1992, Lari and Young, 1990, Magerman, 1994). The typical works on phrase structure grammar induction are as follows(Lari and Young, 1990, Carroll, 1992b): (1) generating all the possible rules, (2) reestimating the probabilities of rules using the inside-outside algorithm, and (3) finally finding a stable grammar by eliminating the rules which have probability values close to 0. Generating all the rules is done by restricting the number of nonterminals and/or the number of the right hand side symbols in the rules and enumerating all the possible combinations. Chen extracts rules by some heuristics and reestimates the probabilities of rules using the inside-outside algorithm (Chen, 1995). The inside-outside algorithm learns a grammar by iteratively adjusting the rule probabilities to minimize the training corpus entropy. It is extensively used as reestimation algorithm for phrase structure grammars.</Paragraph> <Paragraph position="2"> Most of the works on phrase structure grammar induction, however, have partially succeeded. Estimating phrase structure grammars by minimizing the training corpus on- null tropy does not lead to the desired grammars which is consistent with human intuitions (de Marcken, 1995). To increase the correctness of the learned grammar, Marcken proposed to include lexical information to the phrase structure grammar. A recent trend of parsing is also to include lexiccal information to increase the correctness (Magerman, 1994, Collir~, 1996). This means that the lack of lexical information in phrase structure grammar is a major weak point for syntactic disambiguation. Besides the lack of lexical information, the induction of phrase structure grnmmar may suffer from structural data sparseness with medium sized training corpus. The structural data sparseness means the lack of information on the grammar rules. An approach to increase the correctness of grammar induction is to learn a grammar from a tree-tagged corpus or bracketed corpus (Pereira and Schabes, 1992, Black, Lafferty, and Roukos, 1992). But the construction of vast sized tree-corpus or bracketed corpus is very labour-intensive and manual construction of such corpus may produce serious inconsistencies. And the structural-data sparseness problem still remains.</Paragraph> <Paragraph position="3"> The problems of structural-data sparseness and lack of lexical information can be lessened with PDG. Dependency grammar defines a language as a set of dependency relations between any two words. The basic units of sentence structure in DG, the dependency relations are much simpler than the rules in phrase structure grnmmar. So, the search space of dependency grammar may be smaller and the grammar induction may be less affected by the structural-data sparseness. Dependency grammar induction has been studied by Carroll (Carroll, 1992b, Carroll, 1992a). In the works, however, the dependency grammar was rather a restricted form of phrase structure grarnrnarss. Accordingly, they extensively used the inside-outside algorithm to reestimate the grnmmnr and have the same problem of structural-data sparseness.</Paragraph> <Paragraph position="4"> In this paper, we propose a reestimation algorithm and a best-first parsing algorithm for PDG. The reestimation algorithm is a variation of the inside-outside algorithm adapted to PDG. The inside-outside algorithm is a probabilistic parameter reestimation algorithm for phrase structure grammars in CNF and thus can not be directly used for reestimation of probabilistic dependency grammnrs. We define non-constituent objects, complete-link and complete-sequence as basic units of dependency structure. Both of reestimation algorithm and best-first parsing algorithm utilize a CYK-style chart and the non-constituent objects as chart entries. Both algorithms have O(n s) time complexities.</Paragraph> <Paragraph position="5"> The rest of the paper is organized as follows. Section 2 defines the basic units and describes best-first parsing algorithm. Section 3 describes the reestimation algorithm. Section 4 shows the experimental results of reestimation algorithm on Korean and finally section 5 concludes this paper.</Paragraph> </Section> class="xml-element"></Paper>