File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-3013_intro.xml
Size: 5,614 bytes
Last Modified: 2025-10-06 14:03:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-3013"> <Title>Extraction of Tree Adjoining Grammars from a Treebank for Korean</Title> <Section position="3" start_page="0" end_page="73" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> An electronic grammar is an interface between the complexity and the diversity of natural language and the regularity and the effectiveness of a language processing, and it is one of the most important elements in the natural language processing.</Paragraph> <Paragraph position="1"> Since traditional manual grammar development is a time-consuming and labor-intensive task, many efforts for automatic and semi-automatic grammar development have been taken during last decades.</Paragraph> <Paragraph position="2"> Automatic grammar development means that a system extracts a grammar from a Treebank which has an implicit Treebank grammar. The grammar extraction system takes syntactically analyzed sentences as an input and produces a target grammar.</Paragraph> <Paragraph position="3"> The extracted grammar would be same as the Treebank grammar or be different depending on the user's specific purpose. The automatically extracted grammar has the advantage of the coherence of extracted grammars and the rapidity of its development. However, as it always depends on the Treebank which the extraction system uses, its coverage could be limited to the scale of a Treebank. Moreover, the reliable Treebank would be hardly found, especially in public domain.</Paragraph> <Paragraph position="4"> Semi-automatic grammar development means that a system generates the grammar using the description of the language-specific syntactic (or linguistic) variations and its constraints. A meta-grammar in Candito (1999) and a tree description in Xia (2001) are good examples of a semi-automatic grammar development. Even using semi-automatic grammar development, we need the good description of linguistic phenomena for specific language which requires very high level knowledge of linguistics and the semi-automatically generated grammars would easily have an overflow problem.</Paragraph> <Paragraph position="5"> Since we might extract the grammar automatically without many efforts if a reliable Treebank is provided, in this paper we implement a system which extracts a Lexicalized Tree Adjoining Grammar and a Feature-based Lexicalized Tree Adjoining Grammar from Korean Sejong Treebank (SJTree). SJTree contains 32,054 eojeols (the unity of segmentation in the Korean sentence), that is, 2,526 sentences. SJTree uses 43 part-of-speech tags and 55 syntactic tags.</Paragraph> <Paragraph position="6"> Even though there are many previous works for extracting grammars from a Treebank, extracting syntactic features is tried for the first time. 55 full-scale syntactic tags and well-formed morphological analysis in SJTree allow us to extract syntactic features automatically and to develop FB-LTAG.</Paragraph> <Paragraph position="7"> First, we briefly present features structures which are focused on FB-LTAG and other previous works for extracting a grammar from a Treebank. Then, we explain our grammar extraction scheme and report experimental results. Finally, we discuss the conclusion.</Paragraph> <Paragraph position="8"> 2 Feature structures and previous works on extracting grammars from a Tree-bank null A feature structure is a way of representing grammatical information. Formally feature structure consists of a specification of a set of features, each of which is paired with a particular value (Sag et al., 2003). In a unification frame, a feature structure is associated with each node in an elementary tree (Vijay-Shanker and Joshi, 1991). This feature structure contains information about how the node interacts with other nodes in the tree. It consists of a top part, which generally contains information relating to the super-node, and a bottom part, which generally contains information relating to the sub-node (Han et al., 2000).</Paragraph> <Paragraph position="9"> In FB-LTAG, the feature structure of a new node created by substitution inherits the union of the features of the original nodes. The top feature of new node is the union of the top features (f</Paragraph> <Paragraph position="11"> of the two original nodes, while the bottom feature of the new node is simply the bottom feature (g ) of the top node of the substituting tree since the substitution node has no bottom feature as shown in Figure 1.</Paragraph> <Paragraph position="12"> The node being adjoined into splits and its top feature (f) unifies with the top feature (f ) of the root adjoining node, while its bottom feature (g) unifies with the bottom feature (g ) of the foot adjoining node as shown in Figure 2.</Paragraph> <Paragraph position="13"> Several works for extracting grammars, especially for TAG formalism are proposed. Chen (2001) extracted lexicalized grammars from English Penn Treebank and there are other works based on Chen's procedure such as Johansen (2004) and Nasr (2004) for French and Habash and Rambow (2004) for Arabic. Chiang (2000) used Tree Insertion Grammars, one variation of TAG formalism for his extraction system from English Penn Treebank. Xia et al. (2000) developed the uniform method of a grammar extraction for English, Chinese and Korean. Neumann (2003) extracted Lexicalized Tree Grammars from English Penn Treebank for English and from NEGRA Treebank for German. As mentioned above, none of these works tried to extract syntactic features for FBLTAG. null</Paragraph> </Section> class="xml-element"></Paper>