An Efficient Natural Language Processing 
System Specially Designed for the 
Chinese Language 
Lin-Shan Lee* 
National Taiwan University 
Long-Ji Lin* 
National Taiwan University 
K.-J. Chen~ 
Academia Sinica 
Lee-Feng Chien* 
National Taiwan University 
James Huang~ 
Cornell University 
In this paper an efficient natural language processing system specially designed for the Chinese 
language is presented. The center of the present system is a bottom-up chart parser with 
head-driven operation; i.e., phrases are built up by starting with their heads and adjoining 
constituents to the left or right of the heads instead of strictly from left to right. In this way 
many more unnecessary searching actions can be effectively eliminated. The present system also 
includes several efficient approaches such as a direction-selective chart to simplify the control of 
the head-driven operation; a heuristic scheduling policy and a bidirectional look-ahead approach 
to eliminate many unnecessary searching actions, and an improved raise-bind mechanism 
combined with check rules to treat the difficult problems of movement transformations and 
empty categories and to simplify the design of grammar rules. The present design is based on 
careful consideration of some special syntactic phenomena of the Chinese language, such as 
head-final and head-initial structures and empty categories. A prototype of the present system 
has been successfully implemented and extensive experiments have been performed. In the test 
results significant improvement in the efficiency in processing many very complicated Chinese 
sentences has been observed. The detailed discussion on the various approaches, the overall 
system design, and the experimental results will all be presented in this paper. 
1. Introduction 
The use of computers to process natural languages has been the research goal of many 
scientists and engineers for many years, and significant improvement in technologies 
in recent years has brought such goal closer to reality. While substantial efforts have 
been made to process natural languages, especially several western languages such 
as English, and many powerful computational models and algorithms have been pro- 
* Dept. of Electrical Engineering and Dept. of Computer Science and Information Engineering, National 
Taiwan University -~ Dept. of Computer Science and Information Engineering, National Taiwan University 
:~ Dept. of Electrical Engineering, National Taiwan University § Dept. of Modern 
Linguistics, Cornell University, NJ 
¶ Institute of Information Science, Academia Sinica, Taipei, Taiwan 
O 1991 Association for Computational Linguistics 
Computational Linguistics Volume 17, Number 4 
posed and widely used (Gazdar et al. 1987), very little work has been done with 
the Chinese language, which more than a quarter of the world's population use as 
a native language. This is probably due to the fact that the structure of the Chinese 
language is quite different from western languages like English and, therefore, the 
experience in processing western languages cannot necessarily be directly applied to 
the Chinese language. Jiang (1985) proposed a preliminary Chinese parsing prototype 
system based on the METAL system, while Lin (1985) and Lin et al. (1986a, 1986b) also 
developed a Chinese natural language processing system with special considerations 
on the phenomenon of empty categories in the Chinese language. Yang (1987) pre- 
sented a method using semantic constraints to reduce ambiguity in Chinese sentence 
analysis. H. H. Chen et al. (1988) proposed a logic programming approach consider- 
ing Chomsky's Goverment-Binding theory to cope with movement transformations in 
Mandarin Chinese. 
In the following, some special syntactic phenomena of the Chinese language that 
significantly affect the design of the present system are first summarized in Sections 
2 and 3, and a brief description of the present system and the structure of the lin- 
guistic knowledge base is then given in Section 4. The several new approaches, in- 
cluding the direction-selective chart and the head-driven chart parser, the bidirectional 
look-ahead approach, and the heuristic scheduling policy, are described in detail in 
Sections 5, 6, and 7, respectively. Sections 8 and 9 then present the improved design 
of the raise-bind mechanism to cope with the problem of movement transformation 
and empty categories. Some preliminary experimental results are discussed in Sec- 
tion 10, and concluding remarks and future research directions are finally given in 
Section 11. 
2. The Head-Final/Head-Initial Structures of the Chinese Language 
The Chinese language has many special syntactic phenomena substantially different 
from western languages. Discussions about such characteristics of the Chinese lan- 
guage can be found in the literature (Chao 1968; Li and Thompson 1981; Huang 1982). 
In this paper only some of them that have significant influence on the present study 
will be briefly described. They are (1) head-final/head-initial structures and (2) empty 
categories of the Chinese language, to be respectively summarized in this and the 
following sections. 
The notion of the head of a phrase has a very long history, which stems from the 
traditional grammar and plays a central role in recent syntactic analysis frameworks 
such as GB and GPSG (Sells 1985). The basic idea is simply that each phrase con- 
tains a certain word that is especially important in the sense that it determines many 
of the syntactic properties of the entire phrase; this word is called the head of the 
phrase. 
Most Chinese phrases and sentences are head-final, e.g., head nouns in NPs are 
always located at the final position. For instance, some NPs (examples 1, 2, and 3) 
listed in Figure 1 demonstrate this situation, where the underlines indicate the heads. 
Comparing these Chinese phrases with their corresponding phrases in English (shown 
below each Chinese phrase in parentheses), the positions of the heads in English are 
more free. On the other hand, other Chinese phrases that are not head-final are found 
to be almost always head-initial, e.g., PPs (such as example 4 in Figure 1). This is 
somewhat different from western languages like English. Figure 2 is a list of some 
fundamental phrase structure rules (PSRs) for the Chinese language used in the present 
system. The underlines indicate the head of each PSR. Some of the categories here are 
from Chao's classification (Chao 1968), and the rules here are primarily based on the 
348 
Lin-Shan Lee et al. Processing System for Chinese Language 
1. 
2. 
. 
4. 
playing (relativizer) children 
(the children who were playing) 
I the live in America (relativizer) good friend 
(the good friend of mine who lives in America) 
a (classifier) quite pretty girl 
(a quite pretty ig~) 
~ ( ~ ~, ~ )~ ~ ~ )pp ~ 
he from your friends borrow money 
(he borrowed money (from your friends)pp ) 
Figure 1 
Some examples of Chinese noun phrases and preposition phrases. 
(l) S = bar 
(2) S-bar 
(3) s 
(4) NP 
(5) XPDE 
(6) VP 
(7) V-bar 
(8) PP 
Notations: 
Operators: 
--> S-bar PRTAQ I S PRTAG 
--> Topic S 
--> (NP) VP 
--> (XPDE) (QP) (ADJ) N I 
(QP) (XPDE) (ADJ) N I 
NPLOC 
--> S DE I NP DE IPPDE 
--> (AUX I ADV I PP I NP)* V-bar 
-->v QP I v (NP) (NP I PP I VP i S I S-bar) 
--> PREP NP 
\]: Or Operator, *:Repetition Operator, O: Optional Opera,or, : head 
Phrasal Categories: 
S=bar, S-bar, S, NP, XPDE: an Associative Phrase or a Relative/Appositive Clause, 
VP, V-bar, PP, Topic, QP: Classifier anti Measure Phrase. 
Lexical Ca~gories: 
PRTAG: Partical Tag, N, ADJ, ADV, AU'X, LOC: Localizer, DE ( ~ ) : Relativizer 
Figure 2 
A list of some fundamental PSRs for the Chinese language used in the present study. 
theory of Huang (Huang 1982). Apparently, the head in each of the rules is located 
either at the initial position (head-initial) or at the final position (head-final). Such 
head-final/head-initial structures will be especially useful and helpful in the present 
study, as will be clear later in this paper. 
349 
Computational Linguistics Volume 17, Number 4 
3. The Empty Categories of the Chinese Language 
In many languages the "empty category" is typically used to refer to an empty NP 
position that has been vacated by a transformation called "move c~" (a transformational 
operation introduced in government binding theory (GB) that means "move something 
somewhere;" i.e., the NP has been moved to a different position such that an empty 
position is left). Such empty categories are called "traces." They indicate the empty 
positions left when movements occur. There is another kind of empty category that also 
contains vacant NP positions, but they are not traces, because they are not derived from 
"move ct." These empty categories are called "null pronominals." Since the distance 
between the location of the actual NP and its corresponding empty category may be 
long and the grammatic relation in such sentences can be very complicated, it is usually 
difficult to represent such linguistic phenomenon in simple rules. In other words, it is 
difficult to list all such possible movements as well as null pronominals exhaustively, 
and to specify all the relevant constraints explicitly in the grammar. Empty categories 
(or empty NPs) thus become a convenient approach usually used in linguistic theories 
to explain these very complicated syntactic phenomena. 
In Mandarin Chinese, passivization, relativization, topicalization, ba-transformation 
and the use of zero pronouns play major roles in Chinese sentence structures. To deal 
with these syntactic phenomena, the conventional approach is to collect a set of com- 
plicated grammar rules to cover all the possibilities. However, the high complexity 
especially resulting from the interactions among several of these transformations make 
such an approach infeasible. A completely different approach is, therefore, adopted in 
this paper, in which a specially designed raise-bind mechanism is used based upon the 
theory of empty categories, as will be clear later in this paper. With such a raise-bind 
mechanism, it will be shown that the parser will treat all these transformations in 
relatively simple ways. In the following, some examples of empty categories often en- 
countered in the Chinese language are first discussed. Consider the Chinese sentences 
(1)-(8) listed in the following. 
1. ~¢ 4T~ 7 ~ 
he hurt (aspect marker) Jang-san 1 
(He hurt Jang-san) 
. 
. 
ba-transformation: 
I I 
he ba Jang-san hurt (aspect marker) 
(He hurt Jang-san) 
passivization: 
J I 
Jang-san by him hurt (aspect marker) 
(Jang-san was hurt by him) 
1 The transliteration scheme used here is based on the Mandarin Phonetic Symbols lI published in Taipei, Taiwan by the Ministry of Education of the Republic of China. 
350 
Lin-Shan Lee et al. Processing System for Chinese Language 
. topicalization: 
I I 
that dog I never have seen 
(I have never seen that dog) 
5. relativization: 
e sA ~ ~ JJ,~ 
I I 
playing (relativizer) children go 
(the children who were playing are gone) 
6. null pronominals: 
I_ I 
Jang-san tried escape 
(Jang-san tried to escape) 
7. pivot construction: 
I I 
he asked children go to dinner 
(He asked the children to go to dinner) 
8. zero pronoun: 
~ ~ e 
Jang-san likes 
(Jang-san likes someone or something) 
7 
(aspect marker) 
Sentences (2)-(8) all involve a missing subject or object (indicated by "e"). The 
solid lines under sentences (2)-(7) indicate the references that each missing subject 
or object refers to. The missing object in sentence (8), however, does not refer to any 
element within the sentence. In fact, it is an omitted pronoun, which refers to someone 
or something understood in the situation. 
According to GB theory (Chomsky 1981; Huang 1982), sentence (2) is derived from 
sentence (1) by a transformation called "ba-transformation." The word "~ (ba)" is a 
patient case marker. It indicates that the NP following it is the patient of the main verb 
in the sentence. The transformation is performed as follows: the object, " • G (Jang- 
san)" in (1), is moved by the carrier " ~ (ba)" to the position indicated in (2), and 
a trace (indicated by "e") is left behind. The trace dominates no lexical material, but 
is "bound" to its antecedent, " ~ G (Jang-san)." This phenomenon appears very fre- 
quently in Chinese sentences. Similar situations occur in sentences (3)-(5). In sentence 
(3), it is believed in the theory that the object " ~ ~= (Jang-san)" is moved back to the 
subject position and a trace is left behind to transform the sentence into a passive one. 
In sentence (4), the object " ~ Jt~ ~/ (that dog)" can be thought of as being moved to 
the sentence initial position to form a topic. This is also very often seen in Chinese sen- 
tences, and is called "topicalization." In sentence (5), one explanation is that originally 
the relative clause " /l~ ~ t,g ~ (the children were playing)" in the sentence-initial po- 
sition is used to modify the subject " d ~ ~ (children)," but the first " tl~ ~ (children)" 
is omitted due to repetition. This is relativization. All these sentences (2)-(5) involve 
a movement and a trace. In the Chinese language, ba-transformation, passivization, 
topicalization, and relativization all can be analyzed using the movements and the 
traces. The basic idea is that these phenomena are very sophisticated syntactically, but 
351 
PART I PART H 
I the head-driven Chinese direction-selective chart parser with 
sentenc~ \] preprocessor charts ~ several efficient 
I approaches 
Computational Linguistics Volume 17, Number 4 
Linguistic Knowledge-Base 
Figure 3 
The block diagram of the system described in this paper. 
syntax trees 
as long as the empty NP can be inserted into the right position and the movement 
understood, the analysis of these phenomena will be significantly simplified, as will 
be shown later in this paper. 
Sentences (6)-(8) are null pronominals rather than traces, because they are not de- 
rived from "move c~." The notation \[~...\] in sentences (6) and (7) denotes the presence 
of a clause. Null pronominals are in general free, for example, in sentence (8). But in 
certain constructions null pronominals are also bound, for example, in sentences (6) 
and (7). Sentence (7) is called a pivot construction but sentence (6) isn't; this is be- 
cause in sentence (7) the object of the first verb is also the subject of the second verb, 
while in sentence (6) it is the subject of the first verb that is actually the subject of 
the second verb. Therefore in sentence (7) the null pronominal in the subject position 
is "bound" to the object of the first verb, but this is not the case in sentence (6). The 
special techniques of the raise-bind mechanism proposed in this paper to handle all 
such different types of empty categories will be explained in detail later in this paper. 
4. The Overall System and the Linguistic Knowledge Base 
Because the Chinese language has many special structures quite different from many 
other languages, in this paper a Chinese natural language processing system is spe- 
cially designed to parse Chinese sentences more efficiently. The block diagram of the 
system is shown in Figure 3. The system is composed of two parts. The first part, 
consisting of a preprocessor and a lexicon plus word formation rules, first segments 
the input Chinese sentences (or a series of Chinese characters) into words by looking 
them up in the lexicon and applying some word formation rules. This is because, in 
Chinese, a word can be composed of from one to several characters without blanks 
on both ends to indicate the boundaries of a word; therefore, such a segmentation is 
necessary. Because it is impossible to collect all Chinese words into the lexicon, some 
word formation rules can be found to identify the words in the input sentences to help 
the formulation of some compound words; e.g. the determiner/measure compound 
words, the reduplication words, etc., such that they don't have to be stored in the 
352 
Lin-Shan Lee et al. Processing System for Chinese Language 
lexicon. However, because of the high degree of inherent lexical ambiguity, very often 
an input sentence can be segmented into several different possible word combinations 
and there are no simple rules to decide which combination is the correct answer. In 
this preprocessor, a heuristic longest word matching rule (Chen 1985) is applied to 
decide a most promising word combination, but errors still happen sometimes in the 
preprocessor and manual correction is actually needed. The preprocessor also adds 
relevant categorial information and other features extracted from the lexicon to each 
of the words. The result of the first part is represented by a data structure~a direction- 
selective chart (to be discussed in detail in the next section) and is transported to the 
second part. The second part, consisting of a parser and a linguistic knowledge base, 
builds up phrases on the direction-selective chart by applying the linguistic knowledge 
base. The parser is a head-driven chart parser, but with several special approaches de- 
veloped to make the parser more efficient for the Chinese language, which will also 
be made clear later in this paper. The linguistic knowledge base can be broadly seen 
as a compilation of a four-tuple; i.e., the phrase structure rules (PSR), the FIRST and 
LAST parsing tables of these rules, the check rules, and the lexicon shared with the 
first part. If the sentence is grammatical in the sense of the grammar, a syntax tree 
will result as the output. Otherwise, failure will be reported. From now on, this paper 
will concentrate on the second part of the system, i.e., the parser and the linguistic 
knowledge base only, while the details of the first part can be found in other works 
(Ho 1984; Chen 1985). As far as the second part of the system is concerned, the input 
sentences are assumed to be segmented into words with categorial information and 
other features provided by the lexicon. 
The linguistic knowledge base used in this system, as mentioned above, can be 
broadly seen as a compilation of a four-tuple: the phrase structure grammar (PSRs), 
the FIRST and LAST parsing tables for these PSRs, the check rules, and a lexicon 
as shown in Figure 4. The PSRs describe how sentences are built up out of phrasal 
categories, and how phrases are built up out of lexical categories and/or phrasal 
categories. All of these PSRs combined with some syntactic and semantic constraints 
are implemented as an ATN-like network (Woods 1970). For each probable phrasal 
category (constituent), the FIRST and LAST parsing tables indicate all possible lexical 
categories that may begin or end with the present phrasal category to guide the parser 
to eliminate some unnecessary searching actions in parsing, as will be described in 
detail in Section 6. The check rules are used in the raise-bind mechanism to handle 
the binding problems of empty categories and to reject illegal sentences or parsing 
trees, as will be described in detail in Sections 8 and 9. The lexicon is a Chinese 
machine dictionary, in which the allomorphs are stored together with their features and 
other information for syntactic and semantic analysis; e.g., category (CAT), arguments 
(ARG), meaning (MEA), allomorph (ALO), person, number.., etc. 
5. The Direction-Selective Chart and the Head-Driven Chart Parser 
As discussed above, the Chinese language has prominent head-final/head-initial sen- 
tence structures, and zero pronouns are relatively freely used in Chinese sentences. 
Therefore, to reduce unnecessary computation in parsing Chinese sentences, a bottom- 
up and head-driven parsing strategy, as was used in the present study, will be more 
efficient than a top-down and strictly left-to-right parsing strategy. This is because a 
bottom-up parsing strategy can avoid inefficiency in duplicating many computations 
that a top-down parser often suffers from when backtracking occurs, and a head-driven 
parsing strategy can eliminate many unnecessary searching actions (i.e., searching ac- 
tions fired by head constituents could be more promising) that often occur in a strictly 
353 
Computational Linguistics Volume 17, Number 4 
FIRST and LAST Tables 
Phrase Structure Rules @ 
Figure 4 
The linguistic knowledge base. 
+ 
Check Rules 
Lexicon @ 
left-to-right parsing scheme. This will all become clearer later in this paper. Several 
approaches were further developed in the present parser described in this paper to 
better realize this concept, so that significant improvement as compared to some pre- 
vious Chinese natural language processing systems (Yang 1987; Jiang 1985; H. H. Chen 
et al. 1988) can be observed. In the following sections, these approaches, including the 
direction-selective chart, the bidirectional look-ahead approach, the heuristic schedul- 
ing policy, and the raise-bind mechanism will be described in detail. Here, we first 
describe the direction-selective chart in this section. 
Before parsing is performed, any input word sequence has to be first represented 
by the direction-selective chart. Just like a conventional chart (Kay 1980; Winograd 
1983), the direction-selective chart is an efficient data structure to record what has been 
done so far in the course of parsing to avoid duplicate computation. The special feature 
of the direction-selective chart is that the active edges (the incomplete constituents 
that need other complete constituents to their left or right to compose larger ones) are 
further partitioned into two disjoint groups: forward-active (F-active) and backward- 
active (B-active) edges to indicate different search directions as described below. 
In the head-driven parser, the parsing process will begin on the heads in the input 
word sequence. As described in Section 2, the heads in the Chinese language are at 
either the initial or final position of a phrase; therefore, in a head-driven parser, the 
searching actions triggered by an initial head (being a complete constituent) are al- 
ways looking forward (from left to right); while the actions triggered by a final head 
are always looking backward (from right to left). However, no bidirectional search- 
ing actions can be triggered by a single head in the course of parsing. Therefore, in 
this head-driven chart parser, the F-active edges are used to denote forward search- 
ing actions, and the B-active edges are used to denote the backward. The information 
specified on each active edge then consists of the search direction (forward or back- 
ward), in addition to normal information, such as the vertices where the edge starts, 
and ends, the grammar rule referred to, etc. 
Two diagrams depicted in Figure 5 show the two different searching actions. Figure 
5a is the forward search and Figure 5b the backward, in which each arc represents an 
inactive edge (a complete constituent) and each arrow line represents an active edge. 
The labels attached above the inactive edges denote the corresponding categories. 
354 
Lin-Shan Lee et al. Processing System for Chinese Language 
(a) The searching actions triggered by an initial head are always looking forward (left-to-right). 
The sample grammar rule: X -> ~ ... Yn 
X//Y3 
I I 
(b) The searching actions triggered by a final head are always looking backward (right-to-left). 
The sample grammar rule: X -> Y1 ... Yn 
XX\Yn-2 
Figure 5 
The searching actions in the direction-selective chart. 
According to the sample grammar rules listed in the figure, the arrow points out the 
search direction, and a label attached above with a form X//Y indicates that it needs 
a right neighboring complete constituent with Y category to form an X constituent; a 
label with a form X\\Y indicates that it needs a left neighboring complete constituent 
with Y category to form an X constituent. 
To compare with a similar approach, in Stock's island-driven bidirectional chart 
(Stock et al. 1988), the searching actions are triggered by islands (an island is a more 
reliable word hypothesis resulting from speech recognition) and the searching direc- 
tions may be bidirectional; i.e., an active edge may search for constituents on both 
sides as shown in Figure 6. Also, Pareschi and Steedman (1987) had proposed another 
similar bidirectional chart parsing algorithm to handle operations such as functional 
composition for categorial grammars applications (Steedman 1985). However, in our 
parser, the actions triggered by the heads have directions either strictly forward or 
strictly backward, obviously resulting from the head-final/head-initial phenomena of 
the Chinese language. This makes the control of our parser much simpler and more 
efficient in the present problem. 
6. The Bidirectional Look-Ahead Approach 
Since Chinese is believed to be a syntactically ambiguous language with relatively free 
word order, many complicated syntactic phenomena derived from such situation will 
thus make it difficult for a parser to work on Chinese sentences deterministically, as 
was done in Marcus's famous work for English (1982). For example, in long-distance 
movements the distance between the location of the binding NP and its corresponding 
empty category may be long, and the grammatical relation in such a sentence can be 
355 
Computational Linguistics Volume 17, Number 4 
The sample gran'm'tar rule: 
X -> Y1 ... Yi .... Yj.....Yn 
Figure 6 
The searching action with Stock's bidirectional chart parser. 
very complicated. It is also very often difficult for a parser to deal efficiently with the 
binding of the empty category deterministically. This is why substantial redundant 
computation efforts usually occur unavoidably in analyzing Chinese natural language. 
However, some of this redundant computation can be avoided in the present parser, 
by the approach discussed below. 
An active edge built on a chart indicates a stage in the search for a constituent. It 
records the category of the constituent it is looking for, where the constituent should 
be, and the structure obtained so far in order to form a complete one. During the 
course of parsing because the parser usually cannot correctly predict the category and 
position of the constituents to be built next, many unnecessary and redundant active 
edges will inevitably be built and substantial searching efforts thus have to be wasted, 
as very often happened in many chart-based parser s . This is very inefficient and can, 
in fact, be significantly improved in the present system based on the following concept. 
Because no phrasal category is null in the grammar rules, whenever an active edge 
is built into the direction-selective chart, the parser can first examine the constituent, 
exactly located at the position the active edge is looking for, to check whether the 
desired category can begin with (if the active edge is F-active for an initial head) or 
end with (if the active edge is B-active for a final head) the category of the examined 
constituent, according to the description of the grammar. In this way it is possible for 
the parser of the present system to avoid building many unnecessary active edges by 
such a "bidirectional look-ahead approach" combining the special head-driven strategy 
developed in the present study with the concept of FIRST and LAST parsing tables 
(Aho and Ullman 1972) to be discussed in detail below. In the following, we shall first 
define these two tables and then describe how the bidirectional look-ahead approach 
works. 
FIRST(C): If C is a category, FIRST(C) is the set of all possible lexical categories 
the category C can begin with. Meanwhile, a matrix tabulating such FIRST relations 
of all categories of a grammar is called the FIRST parsing table of the grammar. For 
example, Figure 7 is the FIRST parsing table for the sample grammar rules listed in 
Figure 8. For instance, in it FIRST(NP) = {PRON, N} because FIRST(NP) = PRON U 
N U FIRST(XPDE) = PRON U N U FIRST(S) = PRON U N = {PRON, N}. 
LAST(C): If C is a category, LAST(C) is the set of all possible lexical categories the 
category C can end with. Meanwhile, a matrix tabulating such LAST relations of all 
categories of a grammar is called the LAST parsing table of the grammar. For example, 
Figure 9 shows the LAST parsing table for the sample grammar rules listed in Figure 8. 
356 
Lin-Shan Lee et al. Processing System for Chinese Language 
NP 
XPDF-, 
VP 
V-bar 
S 
V- 
V-13 
PRON 
N 
DE 
ADV 
V- V-n PRON N 
X X 
X X 
X X 
X X 
X X 
X 
X 
X 
X 
DE ADV 
X 
X 
X 
Figure 7 
The FIRST parsing table for the sample grammar shown in Figure 7, where each entry filled 
by an "X" indicates that the category (constituent) for the row may begin with the lexical 
category for the column. 
(1)NP --> PRON I (XPDE) N 
(2) XPDE --> S DE I NP DE 
(3) VP --> (ADV)* V-bar 
(4) V-bar --> V- lY.:n- NP 
(5) s --> NP y_P_ 
Figure 8 
A set of sample grammar rules to show the construction of the FIRST and LAST parsing tables. 
NP 
XPDE 
VP 
V-bar 
S 
V- 
V-n 
PRON 
N 
DE 
ADV 
V- V-n PRON 
X 
X X 
X X 
X X 
X 
X 
X 
N DE 
X 
X 
X 
X 
X 
X 
X 
ADV 
X 
Figure 9 
The LAST parsing table for the sample grammar shown in Figure 7, where each entry filled by 
an "X" indicates that the category (constituent) for the row may end with the lexical category 
for the column. 
For instance, in it LAST(VP) = LAST(V-bar) = V- U LAST(NP) = V- U PRON tAN = 
{V-, PRON, N}. 
In the present parser, both of these parsing tables for the Chinese grammar used 
have been constructed (no phrasal category is null). During parsing, when an F-active 
edge is waiting to be constructed (using X//Y as in Figure 5a to indicate its searching 
action), the parser will first look up the FIRST table to see whether the word, at the 
position it is looking for, has a lexical category belonging to FIRST(Y). If it does, then 
the active edge can be built; otherwise the active edge is redundant, because no such 
required constituent can be constructed. Similarly, when a B-active edge is waiting to 
357 
Computational Linguistics Volume 17, Number 4 
(a) 
(b) 
V-bar//NP N belongs to FIRST(NP) 
again hit children 
(hit children again) 
NP \\ XPDE 
ADV V-n t ~ I 
again hit children 
V-n don't belong to 
LAST(XPDE) 
Figure 10 
An example to illustrate the use of FIRST and LAST parsing tables to avoid building many 
unnecessary active edges on the direction-selective chart. 
be constructed (using X\\Y to indicate its searching action), the parser will first look 
up the LAST table to see whether the word, at the position it is looking for, has a 
lexical category belonging to LAST(Y). If it does, then the active edge can be built; 
otherwise the active edge is redundant, because no such required constituent can be 
constructed. 
For example, consider parsing a Chinese phrase, illustrated in Figure 10, with 
the above sample grammar and parsing tables in Figures 7-9. In Figure 10a, an F- 
active edge (V-bar//NP) is triggered by the initial head "q-f (hit)." This indicates that 
a right neighboring NP constituent is needed to form a complete V-bar constituent. 
Fortunately, the right neighboring word " /1" ~ (children)," exactly has an N category 
belonging to FIRST(NP); therefore, the edge can be constructed. On the other hand, 
in Figure 10b, a B-active edge NP\\XPDE is triggered by the final head " d" ~ (chil- 
dren)." This indicates that a left neighboring XPDE constituent is needed to form a 
complete NP constituent. However, in this case the left neighboring word "~ (hit)" 
doesn't have a category belonging to LAST(XPDE); therefore, the edge will not be 
built, because it is apparently redundant. In this way, the bidirectional look-ahead ap- 
proach can, in fact, eliminate many unnecessary searching actions (or active edges) and 
make the parsing process more efficient. A similar approach can be found in Tomita's 
extended LR parser (Tomita 1986), in which the parsing table used is an extended LR 
parsing table, and the parsing process performed is strictly left-to-right, as compared 
to the two different parsing tables and two parsing directions in the present system. 
7. The Heuristic Scheduling Policy 
Each step in the parsing process can very often produce more than one subsequent 
steps. For example, a hew edge built into a chart may cause an arbitrary number of 
edges (candidate constituents) to be built. Usually, in such situations some of them 
358 
Lin-Shan Lee et al. Processing System for Chinese Language 
should be processed prior to the others instead of simply performing exhaustive pro- 
cessing. In other words, a well defined scheduling policy is, in fact, helpful. This is 
why most of the chart parsers have an agenda to schedule these steps (Kay 1980). In 
the present system, a heuristic scheduling policy is also developed, as described in 
this section. 
In the present system the scheduling policy is primarily based on some heuristic 
estimation obtained from empirical experiences, in which each candidate constituent 
is assigned a priority to indicate processing order. Most of the time, the assignment is 
described by its category. For example, a constituent with an S category will be con- 
structed prior to a constituent with a VP category (some unnecessary VP constituents 
may be therefore eliminated, for example), a constituent with a VP category will be 
constructed prior to one with a V-bar category.., etc. On the other hand, if more than 
one candidate constituents have the same priority, the constituent at the right-most 
position (located at the farthest end vertex) is then the first to be built. 
To see how the above heuristic scheduling policy is integrated with the head- 
driven chart parser discussed here to efficiently parse an input sentence, a simple 
example is used in the following to show the parsing process. Suppose the input 
sentence is: 
you of brother again hit children 
(your brother hits children again) 
and the grammar rules, the FIRST and LAST tables used are those shown in Fig- 
ures 7-9, respectively. The resulting chart is shown in Figure 11a, where the numbers 
attached on the edges indicate their order in the course of parsing. In fact, it is easy 
to see that many constructions have been successfully avoided in the chart. To make 
the illustration simple and clear, here we shall follow the parser to analyze only the 
sentence fragment " Y~ ~ t\]~ ~ " (hit children again) as shown in Figure 11b. 
Before the parsing process starts, three inactive edges are constructed in the chart 
to represent the sentence fragment. Then, based on the head-driven principle and the 
sample grammar, the word "q-f (hit)," according to rule (4) in the sample grammar, 
is an initial head (a transitive verb) that needs a right neighboring NP to form a 
V-bar (that is represented by an F-active edge; i.e., edge(l) in Figure 11b); the word 
" d' ~ (children)," based on rule (1), is a final head (a noun) that either can be an NP 
by itself (this is represented by an inactive edge; i.e., edge(2) in Figure 11b) or needs 
a left neighboring XPDE to form an NP (this is represented by a B-active edge(*) in 
Figure 11b). Examining each of these three edges with either the FIRST or LAST tables, 
as illustrated in Figure 10b previously, edge(*) should not be built (it is a redundant 
edge) but edges (1) and (2) both can be potential candidates and, thus, should be built. 
Now, according to the heuristic scheduling policy, a V-bar edge will be built prior to 
an NP edge; therefore, edge (1) is the first edge to be built. However, since there is 
no such NP currently in the chart, no new edges can be produced after edge (1) is 
added, and thus edge (2), the only candidate, is then added to the chart. This edge 
now satisfies the request of edge (1) and, therefore, creates a V-bar inactive edge (edge 
(3)) as a new candidate. Meanwhile, since edge (2) isn't a head, no active edges can be 
triggered, so that edge (3) is the third edge to be built. Similarly, a VP (edge (4)) can 
then be triggered by edge (3), and, finally, a complete VP constituent (edge(5)) can be 
built. 
359 
Computational Linguistics Volume 17, Number 4 
(a) 
N V - 
(14) S 
Co) 
(4) VI~XADV 
(*) N~xXPDE 
(1) V-barl~ 
ADV V-n N 
Figure 11 
An example to demonstrate the parsing process. 
8. The Raise-Bind Mechanism and Check Rules 
The raise-bind mechanism presented here is specially developed in the present system 
to treat the difficult problems of movement transformations and empty categories so 
that the design of the grammar can be simplified. It is used to cope with the empty 
categories; in other words, to find the antecedent for each empty category except for 
those that are free (such as in sentence (8) in Section 3). During the parsing process 
when an NP is desired, the parser, with the aid of the raise-bind mechanism, will 
perform the following operations. 
First, a corresponding active edge indicating the request for an NP may be built in 
the chart after looking up the FIRST or LAST tables. This request can be satisfied when 
a desired NP is actually encountered. Second, if the desired NP is not encountered 
and the NP position can, instead, be filled by an empty category (according to the 
check rules with details explained below), then an empty NP will be generated to fill 
the vacant position and a new edge (active or inactive) denoting this satisfaction will 
be built in the chart. This empty NP will then be raised up in some way along the 
parsing tree (implicitly represented in the chart) when the tree is growing up (recall 
that the parser works bottom up), until .its antecedent is parsed. At this point, the 
parser binds the empty NP by setting it to refer to its antecedent (this is also guided 
by check rules as described below). Once bound, the empty NP will not be raised up 
any further, because an empty NP has exactly one antecedent and cannot be bound 
more than once. 
Not every NP position can be filled by an empty category. In the Chinese language, 
360 
Lin-Shan Lee et al. Processing System for Chinese Language 
empty categories only appear in the subject position and direct object position, never 
in the indirect object position, or the prepositional object position. In our implemen- 
tation, an empty NP contains three fields: (1) a field to keep the pointer to indicate its 
antecendent, (2) a field to keep where it came from, and (3) a field to keep the syn- 
tactic or semantic constraints on the empty NP for later checking. Rules for this kind 
of checking are called check rules in the present system. Most of the time, these check 
rules are invoked when a constituent containing unbound empty categories is built in 
the chart. Usually, distinct rules are used to treat different problems. For example, we 
can informally state the rules to treat the relativization phenomena as follows: for a 
noun and a relative clause to be combined into an NP, the relative clause must contain 
an empty NP that is unbound and marked to be coming from either the subject posi- 
tion or the object position of the relative clause, and then this empty NP will be bound 
to the (head) noun (just as in sentence (5) in Section 3; a further example will be given 
below). We can also state the check rules for passivization as follows: once a clause is 
constructed, the parser checks whether the prepositional phrase, " ~ + NP" (similar 
to "by + NP" in English) is involved in the clause. If so, there must be an empty NP 
that is unbound and marked to be coming from some object position, and this empty 
NP will be bound to the subject of the clause (just as in sentence (3) in Section 3; a 
further example will be given below). The check rules for pivot construction can also 
be formulated as follows: in a pivot construction, the direct object will bind the empty 
NP coming from the subject position of the embedded clauses (just as in sentence (7) 
in Section 3; a further example will be given below). Apparently, check rules for other 
linguistic phenomena such as topicalization, ba-transformation, and so on can all be 
similarly developed. In fact, the binding process in the raise-bind mechanism here is 
rule-based rather than principle-based; that is, the whole binding process in the raise- 
bind mechanism is determined by the check rules and the phrase structure rules, while 
instead in some other principle-based parsers, for example, a parser completely based 
on GB theory (Wehrli 1988), it is influenced by some linguistic principles; e.g. the gov- 
ernment binding principle in GB theory. However, the rule-based approaches may take 
some more cost in computation than the principle-based approaches, but in dealing 
with some specific problems the former approaches seem more flexible than the latter 
approaches. This is why in sentence (7) (pivot construction) the empty category in the 
subject position of the embedded clause can be bound to the NP " t\]~ ~ (children)" 
in the higher clause, even if it is not governed by the NP. To illustrate the operation 
of the above check rules, let's consider phrase (9) in the following and its parsing tree 
in Figure 12, in which several of such phenomena interact with one another. It will be 
shown that, with the present approach, this complicated problem can be solved easily. 
by Li-s ask go to dinner relativizer children 
(the children who were asked by Li-s to go to dinner) 
Let's follow the bottom-up parser to parse phrase (9): (1) Node $1 (a clause con- 
stituent) is constructed and el serves as the dummy subject (an NP). (2) Node V-bar 
is constructed and the dummy object e2 is inserted. Because of the empty category 
el existing in the embedded clause $1, the check rules are invoked. According to the 
check rules for pivot construction, el is bound to e2. (3) Node $2 is constructed with 
an empty NP e3. $2 is a passive clause because of the PP, "by Li-s." According to the 
check rules for passivization, e3 binds e2. (4) Node NP is constructed. According to the 
check rules for relativization, e3 is bound to "children." Notice that only e3 was raised 
up across the node $2, because el and e2 had been bound before $2 was constructed. 
361 
Computational Linguistics Volume 17, Number 4 
e iLn)(relativizer) 
A-- 
(by Li-s) (ask) /~ 
/ \ 
el ~ e2 ~ 
e2 ~ e3 (go to dinner) 
e3 ~ ,,J~ ~ (children) 
Figure 12 
The parsing tree of the example (9). 
Once the parsing tree in Figure 12 is completed, it is easy to answer questions such 
as who were asked and who went to dinner. Since el is the dummy subject of "go to 
dinner" and the binder of el is e2, whose binder is e3, whose binder is "children," 
we can conclude it is "children" who went to dinner. In the same way, we can also 
conclude it is "children" who were asked. 
The raise-bind mechanism also serves as a filter to rule out incorrect sentences or 
incorrect parsing trees. For example, if no empty NP is raised up or no NP is bound 
within a construction involving passivization or relativization, such a construction will 
be rejected by the check rules. On the other hand, some unbound NP could have no 
antecedent. This can be determined from the check rules by looking up the attributes 
of its corresponding verb. If the corresponding argument of the verb can actually be 
omitted, then the parsing tree can be accepted; otherwise, the parsing tree will be 
ruled out. Of course, if this mechanism is adopted for English sentence analysis, a test 
must be performed to rule out sentences with other empty categories that have no 
binder. But such sentences are, in general, grammatical in Chinese (just as sentence (8) 
in Section 3). 
9. Further Discussion of the Raise-Bind Mechanism 
Relativization in Chinese is a long-distance movement; that is, it can sometimes move 
an object across several S (sentence) nodes. The noun phrase in (10) below shows an 
example. On the other hand, the noun phrase in (11) is ambiguous. If the head noun 
("the man") binds el, this NP means "the man whom someone likes." If the head noun 
binds e2, on the other hand, it means "the man who likes someone or something." To 
remove the ambiguity, semantic interactions are needed. 
362 
Lin-Shan Lee et al. Processing System for Chinese Language 
10.\[sq~ ~q ~ ~ Is ~7 ~ Is ~ e\] J \] 
I ask Li-s help me buy (relativizer) 
(the book which I asked Li-s to help me buy) 
11. Is e2 ~X~ el \] ~ .A. 
like (relativizer) the man 
book 
Considering the above situations, we can further improve the check rules as fol- 
lows: for a noun and a relative clause to be combined into an NP, the parser checks 
the "empty-NP list" raised up from the relative clause, and 
• if no empty NP is raised up, rule out the NP constituent; 
• if an empty NP is raised up and marked to be coming from the subject 
position or object position or embedded object position (as in example 
(10)), set the empty NP to be bound to the head noun; 
• if two empty NPs are raised up from both the subject and object 
positions (as in example (11)), employ semantic analysis to determine the 
proper binding (the present system is syntactically-based therefore such 
semantic analysis will be considered in the next phase research). 
Like relativization, topicalization is also a long-distance movement and can be further 
improved in a similar way. 
Another syntactic phenomena crucial to the parser is known as the complex NP 
Constraint (CNPC) (Radford 1981); i.e., no transformation rule can move any element 
out of a complex NP, where a complex NP (CNP) is an NP containing a relative clause. 
This CNPC can be easily encoded in the grammar in the present approach by a simple 
rule; i.e., no empty NPs can be raised up across a CNP node. Hence, it is impossible 
for the empty NP within a CNP to be bound to any element out of that CNP. 
In most cases, ba-transformation and passivization move the direct objects of verbs. 
But the phenomenon known as "subject-to-object raising" (Radford 1981) has some dif- 
ferences. In such a case, the subject of an embedded clause can be moved into the sub- 
ject (or ba-object) position of the higher clause by passivization (or ba-transformation). 
For example, sentence (13) is derived from sentence (12) by such a movement. 
12. ~g ~J:~ "~ -~{-~ i~ {~ ~ ;~ ~-~ 
people future will believe this mistake is correct 
(People will believe in the future that this mistake is correct.) 
13. ~_~g~ ~ ~" ~ Ag ~,~ _¢ ;~ 
this mistake future will by people believe is correct 
(This mistake will be believed to be correct by the people in the future.) 
To cope with such subject-to-object raising, the rules described in the previous section 
for passivization can be modified as follows. The subject of a passive clause will bind 
the empty NP in either the object position or the subject position of an embedded 
clause. 
The raise-bind mechanism is a computational approach to deal with the binding 
of empty categories. Its most attractive feature is, in fact, that it is specially designed 
363 
Computational Linguistics Volume 17, Number 4 
to be used for head-driven strategy as in the present system. In ATN (Bates 1978), 
the hold-list mechanism is used for a similar purpose. However, it is not very helpful 
in parsing Chinese because: (a) it does not really fit the head-driven operation; (b) it 
cannot deal with really unbound empty categories (e.g. example (8)); (c) it handles left 
extraposition (e.g. (2)-(4)), but not right extraposition (e.g. sentence (5)). A movement 
is called left (or right) extraposition, if it moves an NP to the position left (or right) of 
its trace. To deal with right extraposition, ATN uses another mechanisim. 
In GB theory, both left extraposition and right extraposition move an NP to a 
position governing its trace; a null pronominal, if bound, is always bound to an NP 
governing the null pronominal (Chomsky 1981). So, the raise-bind mechanism com- 
bined with the check rules discussed here is sufficient to cope with all empty categories, 
left or right extrapositions, and traces or null pronominals, since its function is simply 
to raise up an empty category to be bound by an NP that governs this empty cat- 
egory. We believe that by means of the raise-bind mechanism it will not be difficult 
to implement some similar linguistic operators such as the slash concept of GPSG 
(Sells 1985). However, this present approach still has some constraints; for example, 
the check rules may take some computation cost, and multiple binding may occur if 
some of the check rules are not consistent in some situations. 
10. Preliminary Experimental Results 
In order to see how the present approaches work as compared to conventional ap- 
proaches in parsing Chinese sentences, an experimental system was implemented and 
extensive experiments have been performed. The system is written in C language and 
performed on an IBM PC/AT. A total of 47 phrase structure rules were used, in which 
19 rules are backward with a final head (indicated by a backward flag) and 28 rules 
are forward with an initial head (indicated by a forward flag). All these rules, together 
with the corresponding FIRST and LAST tables are listed in APPENDIX A and AP- 
PENDIX B, respectively. A total of four tests were performed for each test sentence. 
In test I, a conventional left-to-right parsing strategy without any look-ahead capa- 
bility was used, while in Test II the left-to-right parsing strategy was equipped with 
a forward look-ahead approach (with the FIRST table only). In Test III, the present 
head-driven parsing strategy based on the direction-selective chart was used without 
any look-ahead capability, and finally in Test IV, the present head-driven parsing strat- 
egy based on the direction-selective chart equipped with the bidirectional look-ahead 
approach (with both the FIRST and LAST tables) was used. Because of the flexibility of 
the present direction-selective chart, all the above four tests can be easily implemented 
in the present system. For example, to perform Test I all one has to do is simply switch 
all the flags in the phrase structure rules into the forward mode, etc. 
After every test sentence was parsed in each test and an output syntax tree was 
obtained, the total number of constituents constructed in the process of parsing was 
recorded. In Figure 13, the total number of constituents constructed in each of the 
four tests, together with the number of resulting output parsing trees for 25 typical 
sentence examples picked up from a total of 200 test sentences, are listed. These 25 
typical sentence examples are also listed in APPENDIX C. In Figure 14, the reduction 
ratios of edge construction for the four tests as compared to Test I (i.e., the ratio 
of the number of edge constructions to that of Test I for the 25 sentence examples), 
together with the average reduction ratios for all the 200 test sentences, are shown. 
Also listed in the last row of Figure 14 is the average time necessary to process a 
sentence on the IBM PC/AT for each test. It can be found that, as compared to the 
364 
Lin-Shan Lee et al. Processing System for Chinese Language 
conventional approach in Test I, on average the total number of necessary constituents 
constructed for a sentence is reduced by a factor of 0.635 (or less than 2/3), or the 
required processing time is reduced from 8.2 sec to 3.1 sec for a sentence, through the 
use of the present direction-selective chart, the head-driven parsing strategy, and the 
bidirectional look-ahead approaches (Test IV). Also, if the direction-selective and the 
head-driven parsing strategy are used alone without look-ahead capability (Test III), an 
edge reduction ratio of 0.762 can be achieved and it takes about 6.4 sec on average to 
parse a sentence, which is close to the edge reduction ratio and processing speed (0.758 
and 6.2 sec/sentence) for the use of the FIRST table only with conventional left-to-right 
parsing (Test II). Moreover, another important observation is that the edge reduction 
ratios are more prominent or the present approaches become more efficient when the 
test sentence has a higher degree of ambiguity (a larger number of parsing trees were 
obtained). Test sentences 20 and 22 in Figures 13 and 14 are good examples. In any case, 
these results have shown that the present approaches of direction-selective chart, head- 
driven parsing, and bidirectional look-ahead can, in fact, eliminate many unnecessary 
searching actions (or active edges) and can make the parsing process much more 
efficient than conventional left-to-right parsing strategies in parsing Chinese sentences. 
On the other hand, in order to show the capabilities of the present parser especially 
with the raise-bind mechanism to handle the difficult problem of empty categories, 
parsing results of several typical sentence examples having empty categories are in- 
cluded in APPENDIX D in bracketed text form. These results show that the raise-bind 
mechanism, combined with the check rules proposed here, can certainly treat sophis- 
ticated syntactic problems of empty categories and movement transformations and 
simplify the design of grammar rules. 
Although the lexicon implemented on the present system is relatively small (in- 
cluding 1,120 words) compared with dictionaries for practical applications (it is es- 
timated that at least 80,000 words are necessary), the capabilities of this system are 
clearly demonstrated. It was estimated in the tests that, as a general-purpose system 
without any special problem domain, about 80% to 85% of the sentences in the high- 
school textbooks of Taiwan, can be successfully analyzed by this system, provided 
that all the necessary words are either already in or can be keyed into the lexicon 
before analysis or parsing is performed. This estimate was obtained simply because in 
the tests in fact a total of 241 sentences randomly selected from these textbooks were 
tested and correct parsing trees were obtained for 200 of them. These 200 sentences 
are therefore used in all the above discussions. 
11. Concluding Remarks and Future Research Directions 
In this paper an efficient natural language processing system specially designed for the 
Chinese language is presented. The present design is the result of careful considera- 
tion of some of the special syntactic phenomena of the Chinese language; for example, 
head-final and head-initial structures, empty categories, and movement transforma- 
tions. The present system is an attractive integration of several novel approaches; 
e.g., the head-driven parsing strategy, the direction-selective chart, the bidirectional 
look-ahead approach, the heuristic scheduling policy, and the raise-bind mechanism 
based on check rules, etc. The head-driven parsing strategy can eliminate unneces- 
sary searching actions, and the direction-selective chart simplifies the control of the 
head-driven parsing strategy and makes the parser more flexible in performing many 
different parsing strategies. The heuristic scheduling policy and the bidirectional look- 
ahead approach can, in fact, further significantly reduce the large number of searching 
365 
Computational Linguistics Volume 17, Number 4 
Figure 13 
Test 
sentence 
---'T'- 
2 
""'at'- 
"-"'5"- ----d'- 
-"-b-'- 
"-"I/Y'- 
11 
"-'I'J-- 
"'I-4"- 
15 
16 
17 
20 
Number of edge construotions Nun~bcr ot 
parsing T°s, ~ Test II Test IV 
87 
re j, \[o:. io:, ~I~ 
~' 7 
I~+ le~, + p~, illk ' 
~ ~ I u 
q'1: "T'I' -PI' ~Y ' 
A table listing the total number of constituents constructed in the four tests and the number of 
parsing trees obtained for the 25 typical sentence examples listed in APPENDIX C. 
actions and make the parsing processes more efficient, while the raise-bind mecha- 
nism and the check rules can certainly handle sophisticated problems of movement 
transformations and empty categories, and can simplify the design of grammar rules. 
Although much more work is still in progress to further improve the present system, 
this is definitely a very good initial attempt to efficiently process natural sentences 
of the Chinese language, the structure of which is significantly different from most 
western languages, such as English• 
Although the present system has shown satisfactory initial results, some natural 
difficulties for the Chinese language still remain, such that significant improvement 
over the present system is highly desired. One of the primary difficulties is due to the 
lack of inflections in Chinese words• This gives multiple solutions in word category 
identification and causes exponential growth in the number of structures. It is, there- 
fore, believed that an integrated syntactic and semantic analysis will eventually become 
an inevitable solution in the future. Because verbs are, ultimately, heads of sentences, 
appropriate classification of verbs may help in determining syntactic structure and, 
thus in grasping the semantic meaning of sentences. Although in most linguistic theo- 
ries verbs are classified according to syntactic properties, and some linguistic theories, 
such as Lexical Functional Grammar (Sells 1985) and Case Grammar (Fillmore 1968), 
also provide mechanisms to explicitly represent functional or semantic role assignment 
of constituents, such work for the Chinese language is still relatively preliminary. Chao 
(1968) has only distinguished intransitive verbs from transitive verbs and Yang (1987) 
has made very encouraging initial efforts by classifying Chinese verbs according to 
366 
Lin-Shan Lee et al. Processing System for Chinese Language 
Figure 14 
Test 
Sentc~tc 
1 
4 
5 
6 
/ 
10 
l.j 
16 
17 
18 
19 
2O 
21 
22 
23 
Avert-age 
Ratios 
~verage Speed 
of Pi-ocessi~ g 
(See/Sen - 
~nee) 
Edge Reduction Ratios 
'Test I Test II Test III Test I\ 
1.000 0.725 0.809 0.675 
1.000 0.757 0.718 0.598 
1.000 0.794 0.815 0.679 
1.000 0.764 0.711 0.583 
1.000 0.736 0.808 0.667 
1.000 0.806 0.788 0.717 
1.000 0.760 0.742 0.630 
1.000 0.80~ 0.811 0.675 
1.000 0.800 0.808 0.718 
1.000 0.757 0.875 0.785 
1.000 0.710 0.647 0.586 
1.000 0.719 0.704- 0.632 
1.000 0.664 0.797 0.719 
1.000 u. / 1o u.~ 1~ u.br~t) 
1.000 0.701 0.638 0.569 
1.000 0.748 0.84-4 0.714 
1.000 0.791 O. 808 0.729 
1.000 0.723 0.672 0.617 
1.000 0.729 0.750 0.656 
1.000 0.698 0.592 0.514 
1.000 0.834 0.90.4- 0.772 
1.000 0.817 0.675 0.613 
1.000 0.766 0.717 0.646 
1.000 0.758 0.762 0.635 
8.2 6.2 6.4 3.1 
A table showing the edge reduction ratios for the 25 typical sentence examples listed in 
APPENDIX C and the average reduction ratios for all the 200 test sentences for the four tests 
campared to Test I. 
their transitivity into eight different syntactic classes and giving each verb a semantic 
category that can be used to decide the case frame to solve the problem of serial verb 
construction. Recently, a new classification scheme for Chinese verbs has been devel- 
oped (K.-J. Chen et al. 1988), in which current theories of feature-based categorization 
are adopted. This scheme is based on the results of analyzing 16,824 Chinese verbs, 
with careful consideration given to difficulties in parsing Chinese sentences. In the 
next stage of the present study, this verb classification will be employed, and much 
more syntactic and semantic information will be provided by the lexicon, especially 
for verbs, and represented as complex feature structures (Gazdar 1988). Furthermore, 
several other approaches will also be included in the next stage of the present study, 
such as the unification concept (Sheiber 1986), and the slot and filler principle (Hell- 
wing 1988). In other words, although there is still a very long way to go before a 
really convenient and efficient natural language processing system for Chinese be- 
comes available in the future, the present system apparently serves as a successful 
initial step on the way. 
367 
Computational Linguistics Volume 17, Number 4 
Appendix A 
The phrase structure rules used in the experiments 
1. b $2 --* S PRTAG 
2. b S --, NP VP 
3. f S --* VP 
4. f NP + N 
5. f NP --* PLACE 
6. f NP --* TIME 
7. b NP --, NP LOC 
8. f NP --* LOC 
9. b NP ~ PreN1 N 
10. b NP --, PreN2 N 
11. b NP --+ PreN3 N 
12. f PreN2 --* QP 
13. b PreN2 --* XPDE QP 
14. b PreN1 + XPDE ADJ 
15. b PreN1 --~ QP ADJ 
16. b PreN1 --* ADJ 
17. b PreN1 --~ XPDE QP ADJ 
18. b PreN1 --* QP XPDE ADJ 
19. b PreN3 + QP XPDE 
20. f PreN3 --* XPDE 
21. b XPDE ~ SDE 
22. b XPDE --* NPDE 
23. b XPDE --, PPDE 
24. f VP --* AUX VBAR 
25. f VP --* AUX ADV VBAR 
26. f VP --* ADV AUX VBAR 
27. f VP --* ADV AUX PPVBAR 
28. f VP --* AUX ADV PP VBAR 
29. f VP --* ADV VBAR 
30. f VP ~ PPVBAR 
31. f VP --~ VBAR 
32. f VBAR --* V 
33. f VBAR --~ V QP 
34. f VBAR -- VPP 
35. f VBAR --, V VP 
36. f VBAR --- VS 
37. f VBAR -~ VNPNP 
38. f VBAR ~ VNPPP 
39. f VBAR --, VNPVP 
40. f VBAR ~ VNP 
41. f VBAR --* V NP 
42. f PP ~ PREP NP 
43. b QP --~ DET NO CLMS 
44. b QP ---* NO CLMS 
45. b QP ~ DET CLMS 
46. f CLMS ~ CL 
47. f CLMS ~ MS 
368 
Lin-Shan Lee et al. Processing System for Chinese Language 
APPENDIX B. The FIRST and LAST tables for the sample grammar shown in 
APPENDIX A 
(a) The FIRST table for the sample grammar shown in APPENDIX A 
NO 
IV\[S 
CL 
DET 
PREP 
V ADV 
A~ 
DE ADJ 
N 
I~TAG 
PLAC~ 
TIIVIE 
LOC 
$2 
S 
XPDE 
C'LMS 
NO MS CL 
PRT- PI.,- 
DETPREP V AI)vAUI'( DE A.13J N AG ACE TnVIE LOCI 
zM-r 
mmmummmmmmmmnmmmm 
m m:4m 
m m .:,m 
m i II m ":-- I 
m i, , m.: 
m m.:i mmmm 
m ml mm immm 
m:qm ram: mm~:~ m:a mlnp~.( n i:,m m,:qmm ii mb:4mD:~mb:Gm 
m:qmmmm: m.:.m.:~mm:.m m:.=n:,= m m- ~m~m 
• ~, II ....--m. mm.:~m:Imm.:q mmm.:qmm:amm m: i P:4mm~ 
mira __. ram,: ~{q~m..,..~:q ...... mD: b..immml~m 
~ mmmnm m m~.~mmm mmmw~im ~ ~+~ mmmr.1 -_~ ~.i nlm mm~.'mm m..~aumm w.~mm 
;odmm re'me mmmmm m.:gmmu mmmmmm nn n mm 
m m l --:-- I m mb:~mmmmmn nl 
m.~-- m:mmm ram:aim m:qmm mmmu m:gm mm:~mm mm mm.:m m.:m h°m 
mm:qmm ram: me~m m:4m mmw.~m n m~:qm m:~m mm.:4mb:mb:m 
i= m.:qi m-I Ill \]I nl Ill 
(b) The LAST table for the sample grammar shown in APPENDIX A 
NO 
MS 
CL 
D~T 
pl~.IBJP 
V 
~d~V 
DE 
AIDJ 
I~TAO 
PLACE 
LGg~ 
S2 
S 
VP 
~I~IJDE QP 
PP NY'IB~ 
XPl~rP 
NO MS CL PRT- PL- DET PPJ~P v ADV AUX DE ADJ N AO ;kt")_l~ ~ LOC 
:in 
m Im~.. 
~mmm mmlm 
mmgmm 
mmumm 
mmmm mmmm m m.:~ 
mm i m~ • 
mmm i ill i m~.:.qm • 
mm i i I m.mm I • 
mmm n Nil m: m 
mmm il, iii i 
mmm il n in m ..:..- • 
m i m.:z m.--- ,- :gmm:~m m. I In , ml -:~'-- m.:--,," 
mmm I P.~m I am 
mm:~mm I I m 
mmm ii i 1..--. m.:..,: L 
~mm mm-.~,, m:gm I II iii m l m:gm I ,,n~.',~ ram,. 
m,..'~ I am 
um I, mm I I ..I ml i mm 
.qmmmm m-.~mmm i ml iii i 
369 
Computational Linguistics Volume 17, Number 4 
APPENDIX C. 25 example sentences to show typical experimental results 
1. ~-~J ~2~ ~ ,~ ..12-~ ~f 
my (child) has go to school (aspect) 
(My child has just gone to school.) 
your brother again hit my child 
(Your brother hit my child again.) 
this is a (classifier) can listen mandrain (relativizer) computer 
(This is a computer which can listen to Mandarin.) 
4. ~d~ ~2~ ~'~ ~ ~ ~-~ ~\[~ ~ ~-J ~ 
he give I a (classifier) I very like (relativizer) flower 
(He gave me a flower which I liked very much.) 
I like my painting hang on wall (localizer) 
(I like my painting to be hung on the wall.) 
6. ~ 1~ N~ N ~ fig-t- 
a (classifier)seed sleep in earth (localizer) 
(A seed is planted in the earth.) 
I don't like do business (relativizer) that (classifier) man 
(I don't like that fellow who is a businessman.) 
he is my high school classmate 
(He was may classmate when I was in high school.) 
at Post Office behid there is a row little houses 
(There is a row of houses behind the Post Office.) 
10.~r~! N N~I 5fin ~'~ 
we ask teacher assign job 
(We ask the teacher to assign jobs for us.) 
11. ~ ~\]~ ~ ~\]~ ~ ~2~ ~1~ 
I live in that (classifier) buildings (localizer) 
(I live in that buildings.) 
12. ~,J~,:~ ~ ~d~ ;~ ~k~l\]~--~ 
child laugh he is fat 
(The child laughed at his famess.) 
I find my a (classifier) pen lose in classroom 
(I soon discovered that I had left may pen in the classroom.) 
370 
Lin-Shan Lee et al. Processing System for Chinese Language 
Appendix C, cont'd 
in this (classifier)tree behind we find (aspect) a (classifier) white flower 
(We found a white flower behind this tree.) 
I believe Uang-wu would think Jang-san very like that (classifier) pretty lady 
(1 believed that Uang-wu would think Jang-san liked that pretty lady very much.) 
people must think this is a (classifier) wrong thing 
(People would think this is a wrong thing.) 
17. ~J~ J{~ /f~l~J -- ~a~ ~ ~:~"~ ~ _12 
I (ba) your a (classifier) letter put on deak (localizer) 
(I left a letter of yours on the desk.) 
you tomorrow class about to what time 
(What time will your class end tomorrow ?) 
19.  
he want (ba) that (classifier) house sell to I 
(He wants to sell that house to me.) 
he sayhe very like Coa) that (classifier) green car park on door that 
(classifier) big tree behind 
(He said he liked very much to park that green car beside the big tree near the front door.) 
little bird all day long sit on this (classifier) tree (localizer) wait for mother 
(A little bird sat on the tree all day long waiting for its mother.) 
he say he very like (ha) that(classifier)green car park on door of that 
(classifier) big tree Oocafizer) of a (classifier) small alley (lo~alizer) 
(He says he likes very much to park that green car in a small alley which is beside the big tree near the front door.) 
we yesterday have (aspect) a (classifier) happry holiday 
(We had a very happy holiday yesterday.) 
he (ba) a (classifer)water pour on that (classifer) child body (localizer) 
(He poured a bucket of water over that child.) 
they say this child is that (classifier)school of good student 
(They say this child is a good student of that school.) 
371 
Computational Linguistics Volume 17, Number 4 
APPENDIX D. Sample Parsing Results 
1. INPUT ==> ~-J~ ~1 ~ ~ ~J ~J\[~ ~ ~, ~ J).~j J)~ 
I last year lost (relafivizer) that (classifier)dog I before assume 
~~ T, I~ ~ ~~J T. 
already die (aspect marker) yesterday unexpectedly by me found (aspect marker) 
(I used to assume the dog I lost last year must have already died, but it was unexpectedly 
found by me yesterday.) 
(SENTENCE 
(NP <NOUN> 
(XPDE <S> 
(S 
(NP <NOUN> 
(N <PRON> ===> ~J~ I )) 
(VP 
(ADV ===> ~ ~I ~ last year ) 
(V-bar 
(V <V-n> ===> ~ ~-~ lost ) 
(EMPTY (No.0))))) 
(DEn ===> ~ relativizer)) 
(QP <Mc> 
(DET ===> ~\]\[~ that) 
(CL <Mc> =--> ~ classifer)) 
(N <NOUN> --=> ~J dog )) (s 
(S--bar 
(S 
(NP <NOUN> 
(N <PRON> ==>> ~ I )) (VP 
(ADV ===> J~j before ) 
(V-bar 
(V <V-s> =--> J~J\[ ~ think 
(S (EMPTY (No.2)) 
(VP 
(ADV ~--~-=> ~ ,~ already ) 
W-bar 
(V <V-> ===> ~'\[~-~die) 
(Aspect ===> "j" aspect marker )))))))) 
(EMPTY (No.5)) Cce 
(NP <TIME> 
(N <Time>--=> I~ ~ yesterday )) 
(ADV ===> )~ ~ unexpectedly ) 
(PP <bei> 
(PREP <bei> ===> ~ by ) 
(NP <NOUN> 
372 
Lin-Shan Lee et al. Processing System for Chinese Language 
Appendix D, cont'd 
(N <PRON> ===> ~ me ))) (v-t~ar 
(V <V-n> ===> ~J~\[J found ) 
(EMPTY (No.4)))) 
(PRTAG ===> -~ aspect marker))) 
EMPTY.4 --= EMPTY.5 
EMPTY.0 == N: ( ~ dog ) 
EMPTY.2 == NP: (~J~ ~:~t~i f~J }\]~ ~ ~'J the dogI lost last year) 
EMPTY.5 == NP: ( ~J~ ~ ~ ~ J~- ~ ~\]~ ~\[\[ ~ the dog I lost last year ) 
2. INPUT===> ~_~ ~-~ J~ ~ ~ ~,~,~ ~ ~I~ 
this mistake future will by people believe is correct 
(This mistake will be believed to be correct by the people in the future.) 
(SENTENCE 
(s (NP <NOUN> 
(QP <Me> 
(DET ===> ~.~ ~l this )) 
(N <NOUN> ===> ~-~ mistake)) (vP 
(ADV =~----> J~ future) 
(AUX ---==> @ win ) 
(PP <bei> 
(PREP <bei> ===> ~ by) 
(NP <NOUN> 
(N <PRON> ===> ~ ~j~ people))) (V-bar 
(V <V-s> ===> ~,-~ believe) (s 
(EMPTY (No. 1)) 
(vP 
(V-bar 
(V <SHI> ===>~ is) 
(s (EMerY (No.0) 
(VP 
(V-bar 
(V <ADJ> ===> ~J \[\]~J correct))) ))))))) 
EMPTY.0 == EMPTY. 1 
EMFrY.i == NP:( j.~\[~'~-~ this mistake) 
373 
Computational Linguistics Volume 17, Number 4 
References 
Aho, A. V., and Ullman, J. D. (1972). The 
Theory of Parsing, Translation, and 
Compling, Vol. 1. Englewood Cliffs, NJ: 
Prentice-Hall. 
Bates, M. (1978). "The theory and practice 
of augmented transition network 
grammars." Natural Language 
Communication with Computers, 191-259. 
Chao, Y. R. (1968). A Grammar of Spoken 
Chinese. Berkeley, CA: University of 
California Press. 
Chen, H. H., Lin, I. P., and Wu, C. P. (1988). 
"A logical approach to movement 
transformation in Mandarin Chinese." 
International Journal of Pattern Recognition 
and Artificial Intelligence, (2)1. 
Chen, K. J., and Chang, L. L. (1988). "A 
classification of Chinese verbs for 
language parsing." International Conference 
of Computer Processing and Oriental 
Languages. 
Chen, J. J. (1985). "An experimental parsing 
system for Chinese sentences," M.S. 
thesis, National Taiwan University, Taipei. 
Chomsky, N. (1981). Lectures on Government 
and Binding. Dordrecht: Foris. 
Fillmore, C. (1968). "The case for case." In 
Universals in Linguistic Theory edited by 
Bach and Harms, Holt, Reinhart and 
Winston. 
Gazdar, G., Franz, A., Osborne, K., and 
Evans, R. (1987). "Natural language 
processing in the 1980s." CSLI, Stanford 
University. 
Gazdar, G. (1988). "Categorial structure." 
Computational Linguistics 14:1-19. 
Hellwing, P. (1988). "Chart parsing 
according to the slot and filler principle." 
In Proceedings, International Conference on 
Computational Linguistics, 242-244. 
Ho, W. H. (1984). "Automatic recognition of 
Chinese words." M.S. thesis, National 
Taiwan Institute of Technology, Taipei. 
Huang, J. (1982). "Logical relations in 
Chinese and the theory of grammar." 
Doctoral dissertation, Massachusetts 
Institute of Technology, Cambridge, MA. 
Jiang. (1985). "Chinese parsing: An initial 
exploration at LRC." Computer Processing 
of Chinese and Oriental Language 2(2): 
127-138. 
Kay, M. (1980). "Algorithm schemata and 
data structures in syntactic processing." 
Xerox Report CSL-80-12, Palo Alto, CA. 
Li, C. N., and Thompson, S. A. (1981). 
Mandarin Chinese. University of California 
Press. 
Lin, L. J. (1985). "A syntactic analysis 
system for Chinese sentences." M.S. 
thesis, National Taiwan University, Taipei, 
~Ihiwan. 
Lin, L.-J.; Huang, J.; Chen, K.-J.; and Lee, 
L.-S. (1986). "SASC: A syntactic analysis 
system for Chinese sentences." In 
Proceedings, International Conference on 
Chinese Computing. Singapore. 
Lin, L.-J.; Chert, K.-J.; Huang, J.; Lee, L.-S. 
(1986). "A Chinese natural language 
processing system based upon the theory 
of empty categories." In Proceedings, Fifth 
National Conference on Artificial Intelligence 
(AAAI). Philadelphia, PA. 
Marcus, M. P. (1982). A Theory of Syntactic 
Recognition for Natural Language. 
Cambridge, MA: The MIT Press. 
Pareschi R., and Steedman, M. (1987). "A 
lazy way to chart-parse with categorial 
grammars." In Proceedings, 25th Annual 
Meeting of the Association for Computational 
Linguistics. 
Radford, A. (1981). Transformational Syntax: 
A Student's Guide to Chomsky's Extended 
Standard Theory. Cambridge, U.K.: 
Cambridge University Press. 
Sells, P. (11985). "Lecture on contemporary 
syntactic theories: An introduction to 
government-binding theory, generalized 
phrase structure grammar, and 
lexical-functional grammar." CSLI. 
Sheiber, S. M. (1986). An h~troduction to 
Unification-Based Approaches to Grammar. 
Chicago: University of Chicago Press. 
Steedman, M. (1985). "Dependency and 
coordination in the grammar of Dutch 
and English." Language. 61, 523-568. 
Stock, O., Falcone, R., and Insinnamo, P. 
(1988). "Island parsing and bidirectional 
charts." In Proceedings, International 
Conference on Computational Linguistics. 
Tomita, M. (1986). Efficient Parsing for 
Natural Language: A Fast Algorithm for 
Practical Systems. Boston: Kluwer. 
Wehrli, E. (1988). "Parsing with a 
GB-grammar." In Natural Language Parsing 
and Linguistic Theories, edited by U. Reyle 
and C. Rohrer, Dordrecht, Boston: 
D. Reidel, distributed by Kluwer, 177-201. 
Winograd, T. (1983). Language as a Cognitive 
Process. Vol. 1: Syntax. Reading, MA: 
Addison-Wesley. 
Woods, W. (1970). "Transition network 
grammar for natural language analysis." 
CACM 13(10), 591--606. 
Yang, Y. (1987). "Semantic analysis in 
Chinese sentence analysis." In Proceedings, 
International Joint Conference on Artificial 
Intelligence (AAAI). Milano, Italy. 
374 
