File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-3013_metho.xml
Size: 13,614 bytes
Last Modified: 2025-10-06 14:10:30
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-3013"> <Title>Extraction of Tree Adjoining Grammars from a Treebank for Korean</Title> <Section position="4" start_page="73" end_page="75" type="metho"> <SectionTitle> 3 Grammar extraction scheme </SectionTitle> <Paragraph position="0"> Before extracting a grammar automatically, we transform the bracket structure sentence in SJTree into a tree data structure. Afterward, using depth-first algorithm for a tree traverse, we determine a head and the type of operations (substitution or adjunction) for children nodes of the given node if the given node is a non-terminal node.</Paragraph> <Section position="1" start_page="73" end_page="74" type="sub_section"> <SectionTitle> 3.1 Determination of a head </SectionTitle> <Paragraph position="0"> For the determination of a head, we assume the right-most child node as a head among its sibling nodes in end-focus languages like Korean. For instance, the second NP is marked as a head in [NP NP] composition while the first NP is marked for adjunction operation for the extracted grammar G which uses eojeols directly without modification of SJTree (see the section 4 for the detail of extraction experiments). Likewise, in [VP@VV VP@VX] composition where the first VP has a VV (verb) anchor and the last VP has a VX (auxiliary verb) anchor, a principal verb in the first VP could be marked for adjunction operation and an auxiliary verb in the second VP would be a head, that is, the extracted auxiliary verb tree has every argument of whole sentence. This phenomenon could be explained by argument composition.</Paragraph> <Paragraph position="1"> Head nodes of the extracted grammar for a verb balpyoha.eoss.da ('announced') in (1) are in bold face in Figure 3 which represents bracketed sentence structure in SJTree</Paragraph> <Paragraph position="3"/> </Section> <Section position="2" start_page="74" end_page="74" type="sub_section"> <SectionTitle> 3.2 Distinction between substitution and ad- </SectionTitle> <Paragraph position="0"> junction operations Unlike other Treebank corpora such as English Penn Treebank and French Paris 7 Treebank, full-scale syntactic tags in SJTree allow us to easily determine which node would be marked for substitution or adjunction operations. Among 55 syntactic tag in SJTree, nodes labeled with NP (noun phrase), S (sentence), VNP (copular phrase) and VP (verb phrase) which end with _CMP (attribute), _OBJ (object), and _SJB (subject) would be marked for substitution operation, and nodes labeled with the other syntactic tags except a head node would be marked for adjunction operation. In this distinction, some VNP and VP phrases might be marked for substitution operation, which means that VNP and VP phrases are arguments of a head, because SJTree labels VNP and VP instead of NP for the nominalization forms of VNP and VP. In Figure 4, for example, NP_SBJ and NP_OBJ nodes are marked for substitution operation and AP node is marked for adjunction operation.</Paragraph> <Paragraph position="1"> Children nodes marked for substitution operation are replace by substitution terminal nodes (e.g. NP_SBJ |) and calls recursively the extraction procedure with its subtree where a root node is the child node itself. Children nodes marked for adjunction operation are removed from the main tree and also calls recursively the extraction procedure with its subtree where we add its parent node of a given child node as a root node and a sibling node as a foot node (e.g. VP*). As defined in the TAG formalism, the foot node has the same label as the root node of the subtree for an adjunction operation.</Paragraph> </Section> <Section position="3" start_page="74" end_page="74" type="sub_section"> <SectionTitle> 3.3 Reducing trunk </SectionTitle> <Paragraph position="0"> Extracted grammars as explained above are not always &quot;correct&quot; TAG grammar. Since nodes marked for adjunction operation are removed, there remain intermediate nodes in the main tree.</Paragraph> <Paragraph position="1"> In this case, we remove these redundant nodes.</Paragraph> <Paragraph position="2"> Figure 4 shows how to remove the redundant intermediate nodes from the extracted tree for a verb</Paragraph> <Paragraph position="4"> from extracted trees</Paragraph> </Section> <Section position="4" start_page="74" end_page="75" type="sub_section"> <SectionTitle> 3.4 Extracting features </SectionTitle> <Paragraph position="0"> 55 full-scale syntactic tags and morphological analysis in SJTree allow us to extract syntactic features automatically and to develop FB-LTAG.</Paragraph> <Paragraph position="1"> Automatically extracted FB-LTAG grammars eventually use reduced tagset because FB-LTAG grammars contain their syntactic information in features structures. For example, NP_SBJ syntactic tag in LTAG is changed into NP and a syntactic feature <case=subject> is added. Therefore, we use actually 13 reduced tagset for FB-LTAG grammars. From full-scale syntactic tags which end with _SBJ (subject), _OBJ (object) and _CMP (attribute), we extract <case> features which describe argument structures in the sentence.</Paragraph> <Paragraph position="2"> Alongside <case> features, we also extract <mode> and <tense> from morphological analyses in SJTree. Since however morphological analyses for verbal and adjectival endings in SJTree are simply divided into EP, EF and EC which mean non-final endings, final endings and conjunctive endings, respectively, <mode> and <tense> features are not extracted directly from SJTree. In this paper, we analyze 7 non-final endings (EP) and 77 final endings (EF) used in SJTree to extract automatically <mode> and <tense> features. In general, EF carries <mode> inflections, and EP carries <tense> inflections. Conjunctive endings (EC) are not concerned with <mode> and <tense> features and we only extract <ec> features with its string value. <ef> and <ep> features are also extracted with their string values. Some of non-final endings like si are extracted as <hor> features which have honorary meaning. In extracted FB-LTAG grammars, we present their lexical heads in a bare infinitive with morphological features such as <ep>, <ef> and <ec> which make correspond with its inflected forms.</Paragraph> <Paragraph position="3"> <det> is another automatically extractable feature in SJTree and it is extracted from both syntactic tag and morphological analysis unlike other extracted features. For example, while <det=-> is extracted from dependant nouns which always need modifiers (extracted by morphological analyses), <det=+> is extracted from _MOD phrases (extracted by syntactic tags). From syntactic tag DP which contains MMs (determinative or demonstrative), <det=+> is also extracted</Paragraph> <Paragraph position="5"> The actual procedure of feature extraction is implemented by 2 phases. In the first phase, we convert syntactic tags and morphological analysis into feature structure as explained above. In the second phase, we complete feature structure onto nodes of dorsal spine. For example, we put the same feature of VV bottom onto VV top, VP top/bottom and S bottom because nodes in dorsal spine share certain number of feature of VV bottom. The initial tree for a verb balpyoha.eoss.da is completed like Figure 5 for a FB-LTAG (see Park (2006) for details).</Paragraph> <Paragraph position="6"> Korean does not need features <person> as in English and <gender > or <number> as in French. Han et al. (2000) proposed several features for Korean FBLTAG which we do not use in this paper, such as <adv-pp>, <top> and < aux-pp> for nouns and <clause-type> for predicates. While postpositions are separated from eojeol during our grammar extraction procedure, Han el al. considered them as &quot;one&quot; inflectional morphology of noun phrase eojeol. As we will explain the reason why we separate postpositions from eojeol in the section 4, the separation of postpositions would be much efficient for the lexical coverage of extracted grammars. In Han et al. <adv-pp> simply contains string value of adverbial postpositions. <aux-pp> adds semantic meaning of auxiliary postpositions such as only, also etc. which we can not extract automatically from SJTree or other Korean Treebank corpora because syntactically annotated Treebank corpora generally do not contain such semantic information. <top> marks the presence or absence of a topic marker in Korean like neun, however topic markers are annotated like a subject in SJTree which means that only <case=subject> is extracted for topic markers.</Paragraph> <Paragraph position="7"> <clause-type> indicates the type of the clause which has its values such as main, coord(inative), subordi(native), adnom(inal), nominal, aux-connect. Since the distinction of the type of the clause is very vague except main clause in Korea, we do not adopt this feature. Instead <ef> is extracted if a clause type is a main clause and <ec> is extracted for other type.</Paragraph> <Paragraph position="8"> b: <ep> = eoss b: <ef> = da b: <mode> = decl b: <tense> = past t: <ep> = x, <ef> = y, <mode> = i, <tense> = j t: <ep> = x, <ef> = y, <mode> = i, <tense> = j b: <ep> = x, <ef> = y, <mode> = i, <tense> = j t: <ep> = x, <ef> = y, <mode> = i, <tense> = j b: <ep> = x, <ef> = y, <mode> = i, <tense> = j</Paragraph> <Paragraph position="10"/> </Section> <Section position="5" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 4.1 Extraction of lexicalized trees </SectionTitle> <Paragraph position="0"> In this paper, we extract not only lexicalized trees without modification of a Treebank, but also extract grammars with modifications of a Treebank using some constraints to improve the lexical coverage in extracted grammars.</Paragraph> <Paragraph position="1"> : Separating symbols and postpositions from eojeols. Separated symbols are extracted and divided into a and b trees based on their types. Every separated post-position is a tree. Complex postpositions consisted of two or more postpositions are extracted like one a tree . Finally, converting NP b trees into a trees and removing syntactic tag in NP a trees.</Paragraph> <Paragraph position="2"> Figure 6 and 7 show extracted lexicalized gram- null For extracting trees of symbols and of postposition, we newly add SYM and POSTP syntactic tags which SJTree does not use. See Figure 11 for extracted symbol and postposition trees.</Paragraph> </Section> <Section position="6" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 4.2 Extraction of feature-based lexicalized trees </SectionTitle> <Paragraph position="0"> We extract feature-based lexicalized trees using reduced tagset because FB-LTAG grammars contain their syntactic information in features structures. Extracted grammars G remove syntactic tags, eventually use reduced tagset, add extracted feature structures and use infinitive forms as lexical anchor.</Paragraph> </Section> </Section> <Section position="5" start_page="75" end_page="75" type="metho"> <SectionTitle> * G </SectionTitle> <Paragraph position="0"> : Using reduced tagset and a lexical anchor is an infinitive and adding extracted feature structures.</Paragraph> <Paragraph position="1"> G row in Table 1 below shows the results of extraction procedures above. Figure 8 shows extracted feature-based lexicalized grammars G To simplify the figure, we note only feature structure which is necessary to understand.</Paragraph> <Section position="1" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 4.3 Extraction of tree schemata </SectionTitle> <Paragraph position="0"> As mentioned in the Introduction, one of the most serious problems in automatic grammar extraction is its limited lexical coverage. To resolve this problem, we enlarge our extracted lexicalized grammars using templates which we call tree schemata.</Paragraph> <Paragraph position="1"> The lexical anchor is removed from extracted grammars and anchor mark is replaced to form tree schemata (for example, @NNG where the lexicalized anchor in extracted lexicalized grammars is a common noun). The number of tree schemata is much reduced against that of lexicalized grammars.</Paragraph> <Paragraph position="2"> Table 2 shows the number of template trees and the average frequency for each template grammars.</Paragraph> <Paragraph position="3"> have same lexical coverage since they have same lexical entries. Extracted grammars in this paper are evaluated by its size and its coverage. The size of grammars means tree schemata according to the number of sentences as shown in Figure 9. The coverage of grammar is the number of occurrences of unknown tree schemata in the corpus by the total occurrences of tree schemata as shown in Table 3.</Paragraph> <Paragraph position="4"> set (2,273 sentences) and 10% of test set (253 sentences) null We manually overlap our 163 tree schemata for predicates from T , which contain 14 subcategorization frames with 11 subcategorization frames of a FB-LTAG grammar proposed in Han et al.</Paragraph> <Paragraph position="5"> (2000) to evaluate the coverage of hand-crafted</Paragraph> </Section> </Section> class="xml-element"></Paper>