File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/j91-4001_metho.xml

Size: 19,578 bytes

Last Modified: 2025-10-06 14:12:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="J91-4001">
  <Title>An Efficient Natural Language Processing System Specially Designed for the Chinese Language</Title>
  <Section position="3" start_page="0" end_page="351" type="metho">
    <SectionTitle>
2. The Head-Final/Head-Initial Structures of the Chinese Language
</SectionTitle>
    <Paragraph position="0"> The Chinese language has many special syntactic phenomena substantially different from western languages. Discussions about such characteristics of the Chinese language can be found in the literature (Chao 1968; Li and Thompson 1981; Huang 1982).</Paragraph>
    <Paragraph position="1"> In this paper only some of them that have significant influence on the present study will be briefly described. They are (1) head-final/head-initial structures and (2) empty categories of the Chinese language, to be respectively summarized in this and the following sections.</Paragraph>
    <Paragraph position="2"> The notion of the head of a phrase has a very long history, which stems from the traditional grammar and plays a central role in recent syntactic analysis frameworks such as GB and GPSG (Sells 1985). The basic idea is simply that each phrase contains a certain word that is especially important in the sense that it determines many of the syntactic properties of the entire phrase; this word is called the head of the phrase.</Paragraph>
    <Paragraph position="3"> Most Chinese phrases and sentences are head-final, e.g., head nouns in NPs are always located at the final position. For instance, some NPs (examples 1, 2, and 3) listed in Figure 1 demonstrate this situation, where the underlines indicate the heads. Comparing these Chinese phrases with their corresponding phrases in English (shown below each Chinese phrase in parentheses), the positions of the heads in English are more free. On the other hand, other Chinese phrases that are not head-final are found to be almost always head-initial, e.g., PPs (such as example 4 in Figure 1). This is somewhat different from western languages like English. Figure 2 is a list of some fundamental phrase structure rules (PSRs) for the Chinese language used in the present system. The underlines indicate the head of each PSR. Some of the categories here are from Chao's classification (Chao 1968), and the rules here are primarily based on the  playing (relativizer) children (the children who were playing) I the live in America (relativizer) good friend (the good friend of mine who lives in America) a (classifier) quite pretty girl (a quite pretty ig~) ~ ( ~ ~, ~ )~ ~ ~ )pp ~ he from your friends borrow money (he borrowed money (from your friends)pp ) Figure 1 Some examples of Chinese noun phrases and preposition phrases.</Paragraph>
    <Paragraph position="4">  (l) S = bar (2) S-bar (3) s (4) NP (5) XPDE (6) VP (7) V-bar (8) PP  A list of some fundamental PSRs for the Chinese language used in the present study. theory of Huang (Huang 1982). Apparently, the head in each of the rules is located either at the initial position (head-initial) or at the final position (head-final). Such head-final/head-initial structures will be especially useful and helpful in the present study, as will be clear later in this paper.</Paragraph>
    <Paragraph position="5">  position that has been vacated by a transformation called &amp;quot;move c~&amp;quot; (a transformational operation introduced in government binding theory (GB) that means &amp;quot;move something somewhere;&amp;quot; i.e., the NP has been moved to a different position such that an empty position is left). Such empty categories are called &amp;quot;traces.&amp;quot; They indicate the empty positions left when movements occur. There is another kind of empty category that also contains vacant NP positions, but they are not traces, because they are not derived from &amp;quot;move ct.&amp;quot; These empty categories are called &amp;quot;null pronominals.&amp;quot; Since the distance between the location of the actual NP and its corresponding empty category may be long and the grammatic relation in such sentences can be very complicated, it is usually difficult to represent such linguistic phenomenon in simple rules. In other words, it is difficult to list all such possible movements as well as null pronominals exhaustively, and to specify all the relevant constraints explicitly in the grammar. Empty categories (or empty NPs) thus become a convenient approach usually used in linguistic theories to explain these very complicated syntactic phenomena.</Paragraph>
    <Paragraph position="6"> In Mandarin Chinese, passivization, relativization, topicalization, ba-transformation and the use of zero pronouns play major roles in Chinese sentence structures. To deal with these syntactic phenomena, the conventional approach is to collect a set of complicated grammar rules to cover all the possibilities. However, the high complexity especially resulting from the interactions among several of these transformations make such an approach infeasible. A completely different approach is, therefore, adopted in this paper, in which a specially designed raise-bind mechanism is used based upon the theory of empty categories, as will be clear later in this paper. With such a raise-bind mechanism, it will be shown that the parser will treat all these transformations in relatively simple ways. In the following, some examples of empty categories often encountered in the Chinese language are first discussed. Consider the Chinese sentences  he asked children go to dinner (He asked the children to go to dinner) 8. zero pronoun:  (aspect marker) Sentences (2)-(8) all involve a missing subject or object (indicated by &amp;quot;e&amp;quot;). The solid lines under sentences (2)-(7) indicate the references that each missing subject or object refers to. The missing object in sentence (8), however, does not refer to any element within the sentence. In fact, it is an omitted pronoun, which refers to someone or something understood in the situation.</Paragraph>
    <Paragraph position="7"> According to GB theory (Chomsky 1981; Huang 1982), sentence (2) is derived from sentence (1) by a transformation called &amp;quot;ba-transformation.&amp;quot; The word &amp;quot;~ (ba)&amp;quot; is a patient case marker. It indicates that the NP following it is the patient of the main verb in the sentence. The transformation is performed as follows: the object, &amp;quot; * G (Jangsan)&amp;quot; in (1), is moved by the carrier &amp;quot; ~ (ba)&amp;quot; to the position indicated in (2), and a trace (indicated by &amp;quot;e&amp;quot;) is left behind. The trace dominates no lexical material, but is &amp;quot;bound&amp;quot; to its antecedent, &amp;quot; ~ G (Jang-san).&amp;quot; This phenomenon appears very frequently in Chinese sentences. Similar situations occur in sentences (3)-(5). In sentence (3), it is believed in the theory that the object &amp;quot; ~ ~= (Jang-san)&amp;quot; is moved back to the subject position and a trace is left behind to transform the sentence into a passive one. In sentence (4), the object &amp;quot; ~ Jt~ ~/ (that dog)&amp;quot; can be thought of as being moved to the sentence initial position to form a topic. This is also very often seen in Chinese sentences, and is called &amp;quot;topicalization.&amp;quot; In sentence (5), one explanation is that originally the relative clause &amp;quot; /l~ ~ t,g ~ (the children were playing)&amp;quot; in the sentence-initial position is used to modify the subject &amp;quot; d ~ ~ (children),&amp;quot; but the first &amp;quot; tl~ ~ (children)&amp;quot; is omitted due to repetition. This is relativization. All these sentences (2)-(5) involve a movement and a trace. In the Chinese language, ba-transformation, passivization, topicalization, and relativization all can be analyzed using the movements and the traces. The basic idea is that these phenomena are very sophisticated syntactically, but  The block diagram of the system described in this paper.</Paragraph>
    <Paragraph position="8"> syntax trees as long as the empty NP can be inserted into the right position and the movement understood, the analysis of these phenomena will be significantly simplified, as will be shown later in this paper.</Paragraph>
    <Paragraph position="9"> Sentences (6)-(8) are null pronominals rather than traces, because they are not derived from &amp;quot;move c~.&amp;quot; The notation \[~...\] in sentences (6) and (7) denotes the presence of a clause. Null pronominals are in general free, for example, in sentence (8). But in certain constructions null pronominals are also bound, for example, in sentences (6) and (7). Sentence (7) is called a pivot construction but sentence (6) isn't; this is because in sentence (7) the object of the first verb is also the subject of the second verb, while in sentence (6) it is the subject of the first verb that is actually the subject of the second verb. Therefore in sentence (7) the null pronominal in the subject position is &amp;quot;bound&amp;quot; to the object of the first verb, but this is not the case in sentence (6). The special techniques of the raise-bind mechanism proposed in this paper to handle all such different types of empty categories will be explained in detail later in this paper.</Paragraph>
  </Section>
  <Section position="4" start_page="351" end_page="352" type="metho">
    <SectionTitle>
4. The Overall System and the Linguistic Knowledge Base
</SectionTitle>
    <Paragraph position="0"> Because the Chinese language has many special structures quite different from many other languages, in this paper a Chinese natural language processing system is specially designed to parse Chinese sentences more efficiently. The block diagram of the system is shown in Figure 3. The system is composed of two parts. The first part, consisting of a preprocessor and a lexicon plus word formation rules, first segments the input Chinese sentences (or a series of Chinese characters) into words by looking them up in the lexicon and applying some word formation rules. This is because, in Chinese, a word can be composed of from one to several characters without blanks on both ends to indicate the boundaries of a word; therefore, such a segmentation is necessary. Because it is impossible to collect all Chinese words into the lexicon, some word formation rules can be found to identify the words in the input sentences to help the formulation of some compound words; e.g. the determiner/measure compound words, the reduplication words, etc., such that they don't have to be stored in the  Lin-Shan Lee et al. Processing System for Chinese Language lexicon. However, because of the high degree of inherent lexical ambiguity, very often an input sentence can be segmented into several different possible word combinations and there are no simple rules to decide which combination is the correct answer. In this preprocessor, a heuristic longest word matching rule (Chen 1985) is applied to decide a most promising word combination, but errors still happen sometimes in the preprocessor and manual correction is actually needed. The preprocessor also adds relevant categorial information and other features extracted from the lexicon to each of the words. The result of the first part is represented by a data structure~a direction-selective chart (to be discussed in detail in the next section) and is transported to the second part. The second part, consisting of a parser and a linguistic knowledge base, builds up phrases on the direction-selective chart by applying the linguistic knowledge base. The parser is a head-driven chart parser, but with several special approaches developed to make the parser more efficient for the Chinese language, which will also be made clear later in this paper. The linguistic knowledge base can be broadly seen as a compilation of a four-tuple; i.e., the phrase structure rules (PSR), the FIRST and LAST parsing tables of these rules, the check rules, and the lexicon shared with the first part. If the sentence is grammatical in the sense of the grammar, a syntax tree will result as the output. Otherwise, failure will be reported. From now on, this paper will concentrate on the second part of the system, i.e., the parser and the linguistic knowledge base only, while the details of the first part can be found in other works (Ho 1984; Chen 1985). As far as the second part of the system is concerned, the input sentences are assumed to be segmented into words with categorial information and other features provided by the lexicon.</Paragraph>
    <Paragraph position="1"> The linguistic knowledge base used in this system, as mentioned above, can be broadly seen as a compilation of a four-tuple: the phrase structure grammar (PSRs), the FIRST and LAST parsing tables for these PSRs, the check rules, and a lexicon as shown in Figure 4. The PSRs describe how sentences are built up out of phrasal categories, and how phrases are built up out of lexical categories and/or phrasal categories. All of these PSRs combined with some syntactic and semantic constraints are implemented as an ATN-like network (Woods 1970). For each probable phrasal category (constituent), the FIRST and LAST parsing tables indicate all possible lexical categories that may begin or end with the present phrasal category to guide the parser to eliminate some unnecessary searching actions in parsing, as will be described in detail in Section 6. The check rules are used in the raise-bind mechanism to handle the binding problems of empty categories and to reject illegal sentences or parsing trees, as will be described in detail in Sections 8 and 9. The lexicon is a Chinese machine dictionary, in which the allomorphs are stored together with their features and other information for syntactic and semantic analysis; e.g., category (CAT), arguments (ARG), meaning (MEA), allomorph (ALO), person, number.., etc.</Paragraph>
  </Section>
  <Section position="5" start_page="352" end_page="354" type="metho">
    <SectionTitle>
5. The Direction-Selective Chart and the Head-Driven Chart Parser
</SectionTitle>
    <Paragraph position="0"> As discussed above, the Chinese language has prominent head-final/head-initial sentence structures, and zero pronouns are relatively freely used in Chinese sentences.</Paragraph>
    <Paragraph position="1"> Therefore, to reduce unnecessary computation in parsing Chinese sentences, a bottom-up and head-driven parsing strategy, as was used in the present study, will be more efficient than a top-down and strictly left-to-right parsing strategy. This is because a bottom-up parsing strategy can avoid inefficiency in duplicating many computations that a top-down parser often suffers from when backtracking occurs, and a head-driven parsing strategy can eliminate many unnecessary searching actions (i.e., searching actions fired by head constituents could be more promising) that often occur in a strictly  left-to-right parsing scheme. This will all become clearer later in this paper. Several approaches were further developed in the present parser described in this paper to better realize this concept, so that significant improvement as compared to some previous Chinese natural language processing systems (Yang 1987; Jiang 1985; H. H. Chen et al. 1988) can be observed. In the following sections, these approaches, including the direction-selective chart, the bidirectional look-ahead approach, the heuristic scheduling policy, and the raise-bind mechanism will be described in detail. Here, we first describe the direction-selective chart in this section.</Paragraph>
    <Paragraph position="2"> Before parsing is performed, any input word sequence has to be first represented by the direction-selective chart. Just like a conventional chart (Kay 1980; Winograd 1983), the direction-selective chart is an efficient data structure to record what has been done so far in the course of parsing to avoid duplicate computation. The special feature of the direction-selective chart is that the active edges (the incomplete constituents that need other complete constituents to their left or right to compose larger ones) are further partitioned into two disjoint groups: forward-active (F-active) and backwardactive (B-active) edges to indicate different search directions as described below. In the head-driven parser, the parsing process will begin on the heads in the input word sequence. As described in Section 2, the heads in the Chinese language are at either the initial or final position of a phrase; therefore, in a head-driven parser, the searching actions triggered by an initial head (being a complete constituent) are always looking forward (from left to right); while the actions triggered by a final head are always looking backward (from right to left). However, no bidirectional searching actions can be triggered by a single head in the course of parsing. Therefore, in this head-driven chart parser, the F-active edges are used to denote forward searching actions, and the B-active edges are used to denote the backward. The information specified on each active edge then consists of the search direction (forward or backward), in addition to normal information, such as the vertices where the edge starts, and ends, the grammar rule referred to, etc.</Paragraph>
    <Paragraph position="3"> Two diagrams depicted in Figure 5 show the two different searching actions. Figure 5a is the forward search and Figure 5b the backward, in which each arc represents an inactive edge (a complete constituent) and each arrow line represents an active edge.</Paragraph>
    <Paragraph position="4"> The labels attached above the inactive edges denote the corresponding categories.</Paragraph>
    <Paragraph position="5">  Lin-Shan Lee et al. Processing System for Chinese Language (a) The searching actions triggered by an initial head are always looking forward (left-to-right). The sample grammar rule: X -&gt; ~ ... Yn</Paragraph>
    <Paragraph position="7"> (b) The searching actions triggered by a final head are always looking backward (right-to-left).</Paragraph>
    <Paragraph position="8"> The sample grammar rule: X -&gt; Y1 ... Yn XX\Yn-2 Figure 5 The searching actions in the direction-selective chart. According to the sample grammar rules listed in the figure, the arrow points out the search direction, and a label attached above with a form X//Y indicates that it needs a right neighboring complete constituent with Y category to form an X constituent; a label with a form X\\Y indicates that it needs a left neighboring complete constituent with Y category to form an X constituent.</Paragraph>
    <Paragraph position="9"> To compare with a similar approach, in Stock's island-driven bidirectional chart (Stock et al. 1988), the searching actions are triggered by islands (an island is a more reliable word hypothesis resulting from speech recognition) and the searching directions may be bidirectional; i.e., an active edge may search for constituents on both sides as shown in Figure 6. Also, Pareschi and Steedman (1987) had proposed another similar bidirectional chart parsing algorithm to handle operations such as functional composition for categorial grammars applications (Steedman 1985). However, in our parser, the actions triggered by the heads have directions either strictly forward or strictly backward, obviously resulting from the head-final/head-initial phenomena of the Chinese language. This makes the control of our parser much simpler and more efficient in the present problem.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML