File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-2126_intro.xml
Size: 6,591 bytes
Last Modified: 2025-10-06 14:00:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2126"> <Title>Word Order Acquisition from Corpora</Title> <Section position="2" start_page="0" end_page="871" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Although it is said tha~ word order is free in Japanese, linguistic research shows that there art certain word order tendencies -- adverbs of time, for example, tend to t)recede subjects, mM bunsetsus in a sentence that are modified by a long modifier tend to precede other bunsetsus in the sentence. Knowledge of these word order tendencies would be useful in analyzing and generating sentences.</Paragraph> <Section position="1" start_page="0" end_page="871" type="sub_section"> <SectionTitle> Ii1 this paper we define word order as the order of </SectionTitle> <Paragraph position="0"> nrodifiers, or the order of bunsetsns wlfich depend on the same modifiee. There arc several elements which contribute to deciding the word order, and they are summarized by Saeki (Saeki, 1.998) as basic conditions that govern word order. When interpreting these conditions according to our definition, we era: summarize them ,~ tbllows.</Paragraph> <Paragraph position="1"> Componentlal eonditlons * A bunsetsu having a deep dependency tends to precede a bunsetsu having a shallow dependency. null When there is a long distance between a modifier and its modifiee, the modifier is defined as a bunsetsu having a deep dependency. For example, the usual word order of modifiers in Japanese is tlm following: a bunsetsu which contains an interjection, a bunsetsu which contains an adverb of time, a bunsetsu which contains a subject, and a bunsetsu which contains an object. Here, the bunsetsu containing an adverb of time is defined as a bunsetsu having deeper dependency than the one containing a subject. We call the concept representing the distance between a modifier and its modifiee the depth of dependency.</Paragraph> <Paragraph position="2"> A bunsetsu having wide dependency tends to precede a bunsetsu having narrow dependency.</Paragraph> <Paragraph position="3"> A bunsetsu having wide dependency is defined as a bunsetsu which does not rigidly restrict its modifiee. For example, the bunsetsu &quot;~btqlo_c (to Tokyo)&quot; often depends on a bunsetsu whicll contains a verb of motion such as &quot;ihu (go)&quot; while the bunsetsu &quot;watashi_.qa (I)&quot; can depend on a bunsetsu which contains any kind of verb.</Paragraph> <Paragraph position="4"> Here, the bunsetsu &quot;watashi_ga (I)&quot; is defined as a bunsetsu having wider dependency than 1;11o tmnsetsu ':Tok~./o_c (to Tokyo).&quot; We call the concept of how rigidly a modifier restricts its modifiee the width of dependency.</Paragraph> <Paragraph position="5"> Syntactic conditions * A bunsetsu modified by a long inodifier ton(Is to precede a bunsetsu modified by a short lnodifier. A long modifier is a long clause, or a clause that contains many bunsetsus.</Paragraph> <Paragraph position="6"> * A bunsel, su containing a reference pronoun tends to precede other bunsetsus in the sentence.</Paragraph> <Paragraph position="7"> * A bunsetsu containing a repetition word tends to precede other bunsetsus in the sentence.</Paragraph> <Paragraph position="8"> A repetition word is a word referring to a word in a preceding sentence. For example, Taro mid Hanako in the following text are repetition words. &quot;Taro and Hanako love each other. Taro is a civil servant and Hanako is a doctor.&quot; * A bunsetsu containing the case marker &quot;wa&quot; tends to precede other bunsetsus in the sentence.</Paragraph> <Paragraph position="9"> A mnnber of studies have tried to discover the relationship between these conditions and word order in Japanese. Tokunaga and Tanalca proposed a model for estimating JaI)anese word order based on a dictionary. They focused on the width of dependency (Tokunaga and Tanal~a, 1991). Under their model, however, word order is restricted to the order of case elements of verbs, and it is pointed out that the model can deal with only the obligatory case and it cmmot deal with contextual information (Saeki, 1998). An N-gram model fbr detecting word order has also been proposed by Maruyama (Maruyama, 1994), but under this model word order is defined as the order of morpheines in a sentence. The problem setting of Maruyama's study thus differed fl'om ours, and the conditions listed above were not taken into account in that study. As for estimating word order in English, a statistical model has been proposed by Shaw and Hatzivassiloglou (Shaw and Hatzivassiloglou, 1999). Under their model, however, word order is restricted to the order of premodifiers or modifiers depending on nouns, and the model does not simultaneously take into account many elements that contribute to determining word order. It would be difficult to apply the model to estimating word order in Japanese when considering the many conditions as listed above.</Paragraph> <Paragraph position="10"> In this paper, we propose a method for acquiring from corpora the relationship between the conditions itemized above and word order in Japanese. The method uses a model which automatically discovers what the tendency of the word order in Japanese is by using various kinds of information in and around the target bunsetsus. This model shows us to what extent each piece of information contributes to deciding the word order and which word order tends to be selected when several kinds of information conflict. The contribution rate of each piece of information in deciding word order is efficiently learned by a model within a maximum entrot)y (M.E.) framework. The performance of the trained model can be evaluated according to how many instances of word order selected by the model agree with those in the original text. Because the word order of the text in the corpus is correct, the model can be trained using a raw co> pus instead of a tagged corpus, if it is first analyzed by a parser. In this paper, we show experimental results demonstrating that this is indeed possible even when the parser is only 90% accurate.</Paragraph> <Paragraph position="11"> This work is a part of the corpus based text generation. A whole sentence can be generated in the natural order by using the trained model, given dependencies between bunsetsus. It could be helpful for several applications such as refinement support and text generation in machine translation.</Paragraph> </Section> </Section> class="xml-element"></Paper>