I 
I 
l 
I 
I 
i 
I 
I 
i 
I 
I 
II 
I 
I 
I 
I 
I 
I 
! 
On Parsing Binary Dependency Structures Deterministically in 
Linear Time 
Hard ARNOLA I 
Kielikone Oy 
P.O. Box 126, 00211 Helsinki, Finland 
harfi@kielikone.fi 
and 
Helsinld University of Technology 
Computer Science Laboratory 
Espoo, Finland 
Abstract 
In this paper we demonstrate that it is 
possible to parse dependency structures 
deterministically in linear time using 
syntactic heuristic choices. We in'st prove 
theoretically that deterministic, linear 
parsing of dependency structures is possible 
under certain conditions. We then discuss a 
fully implemented parser and argue that 
those conditions hold for at least one natural 
language. Empirical data demonstrates that 
the parsing time is indeed linear. The present 
quality of the parser in terms of finding the 
right dependency structure for sentences is 
about 85%. 
Introduction 
Natural language sentences have ambiguities at 
many levels of abstraction. Since present 
computational algorithms can handle only partial 
structures, one after another, these ambiguities 
cause problems for parsing. A common solution 
is to create ahemative structures in parallel, and 
explore a forest of possible trees in hope that the 
right parse tree will appear among them. This 
solution for processing ambiguities in parsing 
creates two new problems. Which tree is the 
right one among many in a forest.'? Furthermore, 
in the process of creating alternative structures, 
the number of partial trees tends to grow 
exponentially or at least polynomially with the 
number of words of a sentence. That in turn 
implies similar growth in the processing time. 
If a parsing algorithm were able to make 
confidently only the right local structural 
choices for a sentence, it would deterministically 
produce only a single, correct tree. The benefits 
would be obvious: there would be no search for 
the right tree in a forest, and the processing time 
could be benign. However, to our best 
knowledge, no one has yet been able to produce 
a deterministic parser for a constituent analysis 
of sentences. 
A dependency theory of syntactic structure 
indicates syntactic relations directly between the 
words of a sentence (e.g., Hays, 1964; Hudson, 
1976, Hellwig, 1986; Mel'chuk, 1988; Robinson, 
1970; Schubert, 1986; Starosta, 1988). We have 
studied the parsing of dependency structures 
over several years (Nelimarkka et al. 1984, 
Jiippinen et al, 1986, Valkonen et al., 1987). In 
this paper we discuss the final version of our 
fully implementod dependency parser and show 
that h is possible to design a heuristic 
deterministic dependency parser that parses 
sentences in linear time. The parser chooses 
heuristically only one direct governor among 
alternatives for each word in a sentence. Such a 
deterministic parser runs a great potential risk 
that at some point a wrong choice is made and 
the right parse tree is missed. We demonslrate 
empirically that the quality of the deterministic 
l Formerly Hard Jfippinen 
Current address: Ganesa Oy, It. Teatterik. 1 D 22, 00100 Helsinki, Finland. E-mail: harri@kielikone.fi 
68 
parser can be maintained on a satisfactory level. 
We first discuss deterministic parsing 
theoretically and then proceed to discuss the 
implemented parser. 
1 Strings and Governments 
I.I Direct governments and governed 
strings 
Let x be a node that has certain formal 
properties. Let S = {xl. x2, x3 .... } be a well- 
ordered set or a string of such nodes. (We do 
not discuss the formal properties of nodes here. 
Later on, when nodes are interpreted as word 
forms, their formal properties will be morpho- 
syntactic attributes.) Let R be a binary, 
asymmetric, and antireflexive relation between 
the nodes of S: 
(0 R = { <xi, xi> \] 
xi, x i ~ S, xi Rxi, and i# j } 
We say that xi directly governs xj or is a direct 
governor or a regent of xj. Correspondingly, xj 
is directly governed by or a dependant of xi. 
Graphic representation indicates direct 
govermnents by arrows (Figure 1). 
Rather than using just one direct government 
relation we admit several annotated binary 
relations, distinguished with integer subscripts. 
Let R = {Rb R2, R3, ...} be a set of such binary 
relations. 
We stipulate the following tree constraint for 
the direct government: a directly governed node 
has a unique direct governor. 
R2 
R, I R3 R3 
X2 X3 X4 X5 
Figure 1: A governed string 
We say that a node xi governs xj (i#j) iff either xi 
directly governs xj, or xi governs Xk (k#i,j) and xk 
governs or directly governs xj. If all nodes of a 
string except one are governed by the same 
governor, we say that the string is totally 
governed (by that common governor). Figure 1 
shows a totally governed string that is governed 
by node xl. 
Due to the tree constraint, governed strings are 
topologically trees. We distinguish different 
kinds of ambiguities in strings. A string S is 
unambiguous with respect to a set of relations 
R if each node has only one possible direct 
governor and governing relation. S is locally 
ambiguous but globally unambiguous, if S has 
only one unique totally governed string but at 
least one node has more than one possible direct 
governor or governing relation. S is (globally) 
ambiguous, if there are more than one 
topologically different (or differently annotated) 
totally governed strings for it. 
We stipulate another topological constraint. The 
projectivity constraint states that ifxi Rk xj, for 
any i, j, k and i>j, there exists no Rp such that Xm 
Rp x, for any m and n such that m>j and i<n<j or 
m<i and i<n<j. (The projectivity constraint 
prohibits "crossing" direct governments.) 
1.2 Government Maps 
It is convenient to study governed strings using 
two-dimensional government maps (GM). A 
GM(S,R) is a matrix whose rows represent the 
nodes of a string S and the columns represent the 
relations of R. The ordering of the rows 
corresponds to the ordering of the nodes in S, 
while the ordering of the columns (relations) is 
arbitrary. The direct governor of a node is 
marked at the intersection of the governing 
relation and the governed node. For example, if 
R = {Rb R2, R3, R4} and S = {XL X2, X3, X4, XS} 
Table 1 shows the GM(S,R) of the governed 
string in Figure 1. Formally, GM(S,R) c S x R x 
S. (Henceforth we often simply write GM rather 
than GM(S,R).) 
Node/Relation RI R2 R3 R4 
XI 
X2 
X4 
X5 
X3 
XI 
x3 
X4 
Table I: The GM of Figure 2 
69 
Two GM's are called equidimensional if they 
represent identical strings and identical sets of 
relations and the relations occupy the same 
columns in both maps. 
We borrow a few operations from the set theory. 
A direct government xi R~- x i belongs to a GM 
(marked ~) iffxi and xj are nodes in the GM and 
xi Rk x i is marked in GM. A government map 
GMI includes another equidimensional map 
GM2 (GM2 c_ GM1) if all direct governors in 
GM2 are also in GMI, GMI properly includes 
an equidimensional map GM2 (GM2 c GMI) if 
GM2 c_ GM1 and GMI ¢: GM2. Any given two 
equidimensional maps are identical, if they 
include one another. We also admit unions (u) 
and intersections (c~) of two equidimensional 
GM's in the obvious manner. 
A GM may exhibit just those direct governers 
which constitute a totally governed string, it may 
show any subset of the direct governors of the 
nodes, or it may exhibit all possible direct 
governors of the nodes. We say that a resolved 
GM (GM') is a map that shows only the direct 
governors of a totally governed string. A 
complete and unresolved GM (GM c~) is a map 
that indicates all possible direct governors of the 
nodes. A (partially) unresolved GM (GM u) 
indicates some but not necessarily all direct 
governors of the nodes. For each GM, GM' _c 
GM cu and GM" c GM cu. 
Let GMr(S,R) and GM~"(S,R) represent a 
resolved and the complete and unresolved 
equidimensional maps, respectively. If S is 
unambiguous, GM r = GM% If S is locally 
ambiguous but globally unambiguous, there 
exists only one GM ' and the GM r c GM ~'~. IfS is 
globally ambiguous, there exists more than one 
different GM r and for each GM' c GM ~. 
Finally, if there exists no GM ' such that GM ~ _c 
GM ¢'', we say that the string is ungrammatical 
(with respect to R). 
Table 2 shows the GM ~ of a locally ambiguous 
but globally unambiguous string. Node x4 cannot 
be directly governed by both x3 and xs, hence the 
string is locally ambiguous. Only the former 
choice results in a totally governed string 
(Figure 1 and Table 1). 
70 
Node/Relation Ri R2 R3 R4 
Xl 
x~ 
x3 
x4 
x5 
x3 
XI 
X3 X5 
X4 
Table 2: Locally ambiguous but 
globally unambiguous GM ~'' 
Table 3 shows the GM c° of a both locally and 
globally ambiguous string. Figure 1 shows one 
and Figure 2 shows another governed string 
corresponding to this GM ¢~. If a string is 
ambiguous, at least one row has multiple entries 
in the GM% 
Node.Relation Ri 17,2 R3 R4 
Xi 
X2 
X5 
X3 X! 
X1 
X3 
X4 
Table 3: Locally and globally ambiguous GM ¢" 
R2 
X3 X4 XS 
Figure 2: Another governed string 
1.3 Deterministic parsing 
An GM r carries all necessary information about 
the structure of a governed string. If the process 
of uncovering the structure of a string is called 
parsing, a parsing process equals to the finding 
of the GM r for a given string (or all resolved 
maps if the string is globally ambiguous), and 
the found GM ) represents the parse tree of the 
string. 
The nodes and relations in a GM generate an 
abstract search space for governed strings. 
Therefore, parsing can be viewed as a search for 
the GM ~ in the space genereted by the string of 
nodes and the set of available relations. The 
process begins with an empty map and makes 
progresssively more and more direct governors 
known. The process should end with a GM u such 
that GM' c_ GM ~. If G1Vf c GM ~ there remains a 
residual problem of finding GM ", GM" c GM", 
such that GM'" = GM ~. 
Let us assume that for each globally ambiguous 
string there is single fight parse tree, called the 
preferred tree. We call a parsing process 
deterministic, if it begins with an empty map 
and marks direct governors in the map in such 
an order that when the process ends GM u = GM r, 
where GM" is the explored map and GM r 
represents the parse tree or the preferred parse 
tree if the string is globally ambiguous. 
Theorem 1: Unambiguous strings can be parsed 
deterministically. 
A proof is trivial. Any algorithm which finds all 
possible direct governors of the nodes by 
iterating through all the relations and all the 
nodes creates the GM c'' by definition. And with 
unambiguous strings, GM r = GM ~u. 
The following OS algorithm (for Open Search), 
among others, parses unambiguous strings 
deterministically.Let nR denote the number of 
available relations and ns stand for the number 
of nodes in an input string. 
OS algorithm: 
1. Assign the available relations as columns in 
a GM in random order. 
2. Assign the nodes of an input string as rows 
in the GM in their precedence order. 
3. Mark each cell in the GM empty and each 
row open. 
4. For each column k (k=l .... , nR) test each xi 
(i=l ..... n~) and each open xj (j--i-l, i-2 ..... 
1, i+ 1, i+2 ..... n~) for xi R xj, where R is the 
relation assigned in the column k. Mark each 
found direct governor xi in the GMJj,k\] and 
mark the rowj closed. 
Let us call the number of open nodes (plus 1) 
between a direct governor and the governed 
node at the moment of a test the distance of the 
relation test. 
Distance lt.vpothesis: It is possible to order 
linguistic dependency relations as columns in a 
GM in such a way that the maximum distance 
remains within a fixed boundary when the OS 
algorithm parses natural language sentences. 
(We return to this hypothesis later on in this 
paper.) 
Theorem 2: If the distance hypothesis holds, 
unambiguous strings can be parsed in linear 
time. 
Let us assume that the distance hypothesis holds 
and let d stand for the maximum distance. The 
iteration statement in the OS algorithm is then 
limited as follows: 
.,, 
(j=i-l, j-2, ..., i-d, i+l, i+2, ..., i+d); 
Let C denote the most expensive relation test. 
The OS algorithm consumes in the worst case at 
most C * nR * nN * 2 * d = O(nN). 
Next we show that even ambiguous natural 
language sentences can be parsed 
deterministically in linear time if a certain 
additional condition holds. 
Best-First Conjecture: It is possible to order the 
linguistic relations as columns in a GM in such a 
way that (without violating the Distance 
Hypothesis) the OS algorithm produces for 
natural language sentences the right or the 
preferred GM r most of the time. 
Due to its heuristic flavor, we call the thus 
modified OS algorithm the BF algorithm. 
B F algorithm: 
1. Assign the available linguistic relations as 
columns in a GM in such an order that both 
the Best-First Conjecture and the Distance 
Hypothesis hold. 
2. (steps 2.-4. are as in the OS-algorithm) 
The enforcement of the Best-First Conjecture 
brings a heuristic component in the algorithm, 
and the algorithm does not explore the search 
space fully anymore. Once the algorithm 
chooses a local governor over the alternative 
71 
I 
I 
I 
I 
i 
I 
! 
! 
I 
i 
I 
I 
I 
I 
I 
I 
ones for a word, the alternative local governors 
will be rejected forever. Therefore, there is no 
guarantee that the right parse trees will be 
always produced, hence the phrase "most of the 
time". 
Claim: The BF algorithm parses natural 
language sentences detenninistically in linear 
computational cost of the most expensive 
relation test). 
2.2 Decomposition 
The theoretical model assumes that sentences 
are parsed in one pass. The DCParser divides 
sentences into segments, using conjunctions and 
delimiters as separators. The BF-algorithm is 
time so that the right or the preferred parse trees ,. applied to each segment separately, and the final 
are producedmostofthetime, phase unites the structures built in those 
segments applying the algorithm again. 
This claim is an unprecise empirical statement Decomposition greatly strengthens the Distance 
that can be supported only by empirical means. Hypothesis, but it does not alter the linearity 
That will be done next. proof, since the sum of linear elements is linear: 
2 The Practical Parser 
From now on we assume that strings of nodes 
are natural language sentences and discuss a 
fully implemented parser (DCParser) that parses 
Finnish sentences. The DCParser differs from 
the simple theoretical model described above, 
but, as v~ll be shown below, the differences do 
not alter the theory. 
2.1 Contexts 
The formal part introduced binary relations as 
context-free ordered pairs (1). Dependency 
relations in the implemented parser use contexts. 
Formally, they could be expressed as context- 
sensitive ordered pairs as in (2), but the 
DCParser uses different rule syntax as discussed 
in 2.5. 
(2) Ri = { <\[cxl\]x\[cxr\],icy~\]y\[cy,\]>l 
x, y are morphosyntactic representations 
of the direct governor and the governed 
word form, 
cx~, cxr, cyi, cy, are morpho-syntaetie 
representations of the left and the right 
contexts ofx and y, respectively, 
and x Riy }. 
The use of contexts in relations adds another 
heuristic component to the BF-algorithm, and 
one dependency relation may require quite a few 
but fixed number of such context sensitive 
definitions. Contexts do not alter, however, the 
linear time behavior of Theorem 2. They only 
increase the value of the constant C (the 
(3) O(ni) + O(nj) + ... + O(nk) = O(nl~), 
ni, nj, ..., m. <__ns 
where n~ is the number of the words in a 
sentence. 
2.3 Homographic disambiguation 
The theoretical model did not discuss ambiguous 
nodes. In practice a word form can have several 
alternative morphotactic interpretations. The 
DCParser has a separate morphological analysis 
phase which produces all possible morphotactic 
interpretations for the word forms of input 
sentences. A separate preprosessing phase 
explicitly disambiguates most of the lexical and 
homographic ambiguities of Finnish word forms 
using context sensitive rules designed for the 
purpose (Nyl~nen, 1986). The remaining 
ambiguities are resolved implicitly by the 
DCParser as follows. When an interpretation of 
an ambiguous word form qualifies as a governed 
node the alternative interpretations will be 
rejected. This strategy implements yet another 
heuristic component for the parser, but the 
strategy does not alter the linearity argument 
presented earfier. 
2.4 The dependency relations 
The parser uses 32 different binary dependency 
relations for Finnish. The coordinating relations 
are discussed in 2.5. The most important other 
relations are listed in Table 4. The typical 
syntactic categories for the regents and for the 
dependants are also shown. Space does not 
allow a discussion of the individual relations. 
They are visualized in examples below. By 
72 
stipulation, the finite verb of the main clause is 
the head of a grammatical sentence. 
Relation name Dependant Resent 
IntensAttr Adverb Adverb/Adj. 
ModAttr Adverb Noun/Adj. 
AdjAttr Adjective Noun 
GenAttr Noun Noun 
QuantAttr Num/Adv./Pron. Noun 
NomAttr Noun Noun 
MaterAttr Noun Noun 
InfAttr Verb Noun 
RelAttr Verb Noun 
ClauseAttr Verb Noun 
PostpComp Noun Postposition 
PrepComp Noun Preposition 
NegComp Verb NegVerb 
AuxComp Verb Copula 
Subject Noun Verb 
Object Noun Verb 
Adverbial Noun/Adverb Verb 
Complement Noun/Adjective Copula 
Connector Delimiter Verb/Noun 
Separator Delimiter Verb 
Head Verb none 
Table 4: Common relations 
2.5 Coordinations 
ConjPreComp ConjPreComp Subject 
John v, Bill a~nd hfdry laughed 
Subject Object ConjPostComp ConjPostComp 
I s~v John ; Bill and Mary 
Figure 3: Coordinations 
Coordinations are one of the main sources of 
syntactic ambiguity in natural language 
sentences. For us they cause also a notational 
problem, since coordinations do not seem to be 
prima facie binary relations. The DCParser treats 
a coordination as two coexisting binary 
relations. One word governs the coordinator 
which governs the other word. By stipulation, 
that word among coordinated words which is 
closest to the regent becomes the head of the 
coordination. For example, the coordinated 
subject in the sentence John, Bill and Mary 
laughed is ascending, while the coordinated 
object in the sentence I sin,, John, Bill and Mary 
is descending as Figure 3 illustrates. 
2.6 Subordinate clauses 
The DCParser treats finite subordinate clauses 
so that the subordinating conjunction serves as a 
linking word between the heads of the main and 
the subordinate clauses. The conjunction is in 
the relation in question, and the head of the 
subordinate clause is in the ConjPostComp- 
relation with the conjunction. Below there is a 
Finnish example sentence from the corpus, its 
rough word-for-word translation and the parse 
tree produced by the DCParser (4). This 
sentence exemplifies both subordinate clauses 
and coordinations. In this output mode the 
DCParser displays word forms as triplets: 
surface form, Relation, base form. Hierarchy is 
indicated using indentation: the regent of a given 
dependant is the first word below that is 
indented one step less. 
Riittda, kmt puolueetja niidenj~rjesti~t 
\[It is enough\], \[when\] \[the parties and their organiz.\] 
velvoitetacai : lainsaadt~mtn m'ullajulkaisemaan 
\[are compelled\] \[using legislation\] \[to publish\] 
tarkasti tililq~aattkset, budjettinsaja 
\[accurately\] \[financial statements, their budgets and 
lahjoituksensa. 
their donations\]. 
(4) 
-,, Connector, _COMMA 
I-puolueet, ConjPreComp, puolue 
\[-ja, CoordPreDep, ja 
I-niiden, GenAttr, ne 
I-ji~rjesltt, Object, jarjestO 
I-lains~l~mOn, GenAttr, lainsaadant6 
-m~//a, Adverbial, apu 
I-/ahjoituksensa, ConjPostComp, lab... 
I-ja, CoordPostDep, ja 
~-budjettinsa, ConjPostComp, budjetti 
I-,, CoordPostDep, _COMMA 
~.tifinpadtOkset, Object, tilinp~tts 
I-tarkasti, Adverbial, tarkasti 
{-julkaisemctan, Adverbial, julkaista 
I-veivoitetacm, ConjPostComp, velvoittaa 
./am, Adverbial, kun 
I--, Separator, _PERIOD 
Riittliii, Head, RiittM 
73 
Another sentence from the corpus and its parse 
tree are as follows: 
Kysymys askarntttaa koko maaJlmaa tlyt, 
\[The question\]\[puzzles\] \[the whole wodd\]\[now,\] 
hm Yhdysvalta#! retmblikacmit 
\[when\] \[the republicans of the U.S.\] 
mat pit/nleet /molnekokmtkSensa 
\[have held\] \[their party congress\] 
ja nimomeet presidentti Bushin 
\[and\] \[nominated\] \[president Bush\] 
ja varapresidemti Da71Quaylen 
\[and\] \[vice-president Dan Quayle\] 
taisteluparikseen, 
\[as their fighting couple,\] 
/mten neljd aruotta sittetL 
\[like\] \[four years ago.\] 
(5) 
I-Kys)vnys, Subject, Kysymys 
l-koko, AdjAttr, koko 
.maailmaa, Object, maailma 
1-,, Connector, _COMMA 
\]-Yhdysvaltam, GenAttr, Yhdysvallat 
I-republikawlit, Subject, republikaani 
I-puoluekokouksensa, Object, puoluekokous 
\]-presidentti, NomPreAttr, presidentti 
\[ \[-varapresidentti, NomPreAttr, .. 
I \[-Dan, NomPreAttr, Dan 
I I'Quay len, ConjPostComp, Quayle 
\]-ja, CoordPostDep, ja 
\[oBnshin, Object, Bush 
I I-,, Connector, _COMMA 
\]°mistehtparikseen, Adverbial, taistelu... 
I \]-neljd, QuantAttr, nelj~i 
I \[-vuotta, PostpComp, vuosi 
\[ \[-sitten, ConjPostComp, sit, ten 
\[-kuten, Apposition, kuten 
\[-nimemwet, ConjPostComp, nimeta 
I-ja, CoordPostDep, ja 
\[-pitaneet, AuxComp, pit~ 
I-ovat, ConjPostComp, olla 
-/ran, ClauseAttr, kun 
I-nyt, Adverbial, nyt 
I--, Separator, PERIOD 
I-askarruttaa, Head, askarruttaa 
2.7 The grammar 
In the DCParser word forms are represented as 
objects of morpho-syntactie attributes. For 
example, the word form jdrjest~t (organizations) 
appears as \[Form="jarjestOt", Lex="jiirjest6", 
Cat=Noun, Case=Nom, Number=PL\] 
For efficiency reasons binary relations are 
expressed as active rules. The testing of a 
relation, then, corresponds to the activation of 
the respective rule or a set of alternative rules. 
For example, a simplified rule for AdjAttr 
(adjectival attribute of nouns) reads as: 
(6) 
- Rule for adjectival attributes 
Redo AdjAttr 
Node Focus \[ Cat=Noun, Cs:=Case, Nm:=Number \] 
DI := DepCand Left(l) \[ Cat=Adjective, 
Case--Cs, Number=Nm \] 
-> MakeDep D1 \[ Rel:=AdjAttr \] 
A rule has two main parts: the condition part and 
the action part. The condition part searches and 
tests qualifying dependants and possible 
contextual words. A word qualifies in a test if its 
attribute object satisfies the description given in 
the rule. Variables can be used for passing 
attribute values. C: =" assings a value; "=" tests a 
value) The action part binds and names 
dependants and assigns values to attributes. The 
rule above iteratively (Redo) binds immediately 
preceeding adjectives as attributes if they agree 
in the case and number with the head noun. 
Rules are classified into generic rules (grammar 
proper) and lexical rules. Their expressive 
power is identical. The former are activated by 
syntactic categories. (6) visualizes a simple 
generic rule. Lexical rules are activated by 
specific lexemes. For example, (7) describes a 
part of a complex rule for Finnish verb pitdd. 
(7) 
LexBlock "pitaa" 
-- pitim tehdit 
Once pitaa ! 
DI "= DepCand Right(4) \[ Modal=Iinf \] 
-> MakeDep D1 \[ Rel:=Subject \] 
Focus \[ SubCat'--InfSubj \] 
- pitM jostakin 
Once pitaa I 0 
D1 := DepCand Right(4) \[ Cat=Noun+Proper+ 
Pronoun, 
Case=El \] 
Not Node From DI Right(l) \[ Modal=IpartiO- 
llpartic \] 
-> MakeDep D1 \[ Rel:=Adverbial \] 
Focus \[ SubCat:=Intr \] 
74 
Pit#ti has several senses and subcatagories in 
Finnish. (7) shows two of them. The first 
alternative treats the verb as a modal verb as in 
Minun pit~ menn~ saunaan (I must go to the 
sauna). (In our linguistic analysis we treat the 
infinitive menna (to go) as the subject of the 
modal verb pitOO and the genitive minun (?I) as 
the subject of the infinitive.) The second 
alternative handles the idiomatic usage Mind 
pidOn hanest~ (I like her) where a surface elative 
adverbial represents a deep semantic object of 
pitaO. The rule binds an elative as adverbial, but 
does not bind it if the elative is followed by a 
participle as in Min~ piclOn hOnestO l~htev~std 
tuoksusta (?1 like the ~agrance coming from 
her). 
The grmnmar (Arnola, 1998) consists of about 
950 generic rules and of about 12 500 lexical 
rules. An algorithm, which implements the Best- 
First strategy, controls the activation of the rules. 
3. Empirical Results 
3.1 Benchmark test suite 
The parser has been under development for 
years. It is an integral part of a commercial 
machine translation system called TranSmart®. 
A benchmark test suite of correctly parsed 
sentences (source sentences and their correct 
parse trees) has been accumulated during this 
period. Only sentences that have revealed 
grammatical errors in the parser have been 
added to the test suite after the errors were 
corrected. Otherwise the test suite sentence have 
been randomly selected. The test suite sentences 
are periodically parsed to guarantee monotonous 
improvement of the grammar. 
200 T =_ "°I 
,50 ,,o+ / 1 
e° T ~ 
60 z 
O 10 20 30 40 50 60 
Figure 4: Distribution of the sentence lengths of 
the benchmark test suite 
As of this writing, the benchmark test suite 
comprises over 3000 sentences. The distribution 
of the sentence lengths (including delimiters) is 
shown in Figure 4. The average sentence length 
is 12.1 words. 
3.2 Linearity argument 
We used the benchmark test suite sentences to 
test the linearity claim. Figure 5 shows the 
distribution of the parsing times in seconds. The 
processor is an old Intel 486, 66 MHz. A 150 
MHz Pentium processor parses about 400 
sentence/minute of running text. 
1.9 
1.6 i ~ 0 
0.6 ~ C: 
0.2 
0 -~:::~:~ , • 
D 
0 10 20 30 40 SO 60 
Figure 5: Distribution of the parsing times of 
the test suite sentences 
1.8 7 
;.6 ,~ 
1.4 
L2 ,~ t 
0.8 ~- 
0.6 J 
0.4 1 
0.2 i 
0 t ~~ 
0 
---"2t2_ 
=.-~-- 
I 0 2O 30 
G 7- 7- 
40 50 60 
Figure 6: Average parsing times for diffe~er, t 
sentence lengths 
Figure 6 plots the average parsing times for each 
sentence length. Sentences whose length is 
between 5 and 20 words form statistically 
meaningful sets. Their average parsing times 
form a clear linear function. Longer sentences 
do not support a contrary view. 
3.3 Quality 
It remains to discuss the quality of the parser. 
Weuse the following strict criterion for the 
correemess of a parse tree. A sentence is parsed 
75 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
correctly if the sentence is grammatical and the 
produced dependency structure completely 
complies with the structure a competent human 
judge would assign to it. Otherwise the parse 
tree is judged incorrect. Hence, a single, local 
structural error in an otherwise correct parse trc¢ 
disqualifies the st~cture. If a sentence is 
globally ambiguous but it is clear for a human 
reader which structure is meant, the structure is 
judged correct only if it is in agreement with the 
human decision. If a human reader cannot make 
the right choice for an ambiguous sentence 
without textual context, the structure is deemed 
correct if it is one of the possible correct 
structures. 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
181 148/82% 19/10°/0 6/3% 8/4% 
253 207/82% 26/10°/o 11/4% 9/4% 
380 331187°/0 38110°/0 4/1% 7/2% 
216 174/81% 27/13% 8/4% 7/3% 
297 249/84% 33/11% 8/3% 7/2% 
387 334/86% 34/90/0 11/3% 8/2% 
196 166/85% 16/8% 8/4% 6/3% 
267 224/84% 25/9% 8/3% 10/4% 
118 97/82% 16/14% 2/2% 3/3% 
2680 2271/85% 252/90/0 83/3% 74/3% 
391 337/86% 36/9*/0 1 i/3% 7/2% 
Table 5: Parsing quality of the test samples 
Figure 7 shows the percentage numbers of each 
column in a graphic form. Lines are fitted to the 
data to indicate possible tendencies of the series. 
Presently the DQParser is fully developed in the 
sense that it is in practical use in commercial 
machine translation systems. However, the 
tuning of the parser still continues. The parser 
has been subjected to tens of thousands of 
genuine unedited sentences from different 
sources over the years. Each parse tree has been 
carefully studied and all indicated errors or gaps 
that could be systematically corrected were 
corrected in the grammar and in the lexicons. 
About once a week the benchmark test suite was 
processed and possible errors found in the test 
suite were corrected. 
Occasionally (about oncein a month or two) a 
fresh piece of text was randomly selected. The 
total number of sentences in the text and the 
I number of sentences parsed correctly right away 
were recorded. The incorrectly parsed sentences 
were classified into three classes: the ones 
I parsed correctly after (only) lexical corrections, 
the ones parsed correctly after grammatical 
corrections (and possible lexical corrections), 
and the ones whose parsing errors could not be 
I corrected in systematic fashion. These a errors 
exhibit a fundamental drawback of the Best-First 
strategy. Table 5 shows the data of these test 
I samples. Each column presents both absolute 
and relative numbers: absolute/percentage%. 
I 
I 
Text No. Parsed Rcq. Req. Fatally 
of correctly Icxical gramm, in. 
..................... ,,s~_t._ .................. c~r¢ct= concc:t. _coF_.cct . 
1 375 294178% 50/13% 19/5% 12/3% 
2 196 148/76% 34/17% 8/4% 8/4% 
3 196 162/83% 18/90/0 10/5% 613% 
100 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
~ i ~ ~ " " " ~1 
' i i i..; '.~ i.b\[:~ :.~ 
-! d-i-~-!-ddd-~'t't i i i ! i i:! !.! ! .~ 
.-.! ...... ~-.: --T-..;--T~ :~--.~:F..T-T- 
_~_.¢_~._4 ~ ~.~_~.~..~_-~_ iii.ii !. i:i !i i 
1 3 5 7 9 11 13 
correct ii 
il 
!-.--,oxen, i! corre ons !! 
II 
= Jl. Gra,~',mr I1 
corrections i i 
ii 
X Fatally =; 
incorrect i! 
il 
I! 
Un. ('n~a=y I! 
correct) I! 
Ii 
! ti 
Figure 7: Parsing quality of the 14 last samples 
Table 5 and Figure 7 show that the parser seems 
to embody a stable 2-4% error ratio due to 
fundamental problems in the Best-First strategy. 
Approximately the same number of sentences 
(2-5%) have revealed grammatical deficiencies 
in the parser. This figure may have a slow, 
although not clear declining trend. 9-17% of the 
sentences have revealed lexical deficiencies, and 
this figure seems to have a slow declining trend. 
76-87% of the sentences were parsed correctly 
right away, and this figure seems to show a clear 
I 76 
1 
if slow upward trend. (The test samples cover 
almost two years of rather intense tuning.) 
Conclusion 
In this paper we have argued that it is possible to 
parse binary dependency structures of natural 
language sentences deterministically and in 
linear time, and to keep parsing quality within 
acceptable limits, if syntactic heuristics is 
applied appropriately. A possibility for linear 
parsing has been proved theoretically and 
demonstrated empirically. The quality issue was 
discussed using empirical data. Determinism 
was accomplished with a Best-First search 
algorithm which implements syntactic heuristics 
in three ways: 1) in a permanent ordering of the 
testing of dependency relations, 2) in the 
implicit disambiguation of homographic word 
form interpretations, and 3) in the contexts of 
dependency relation rules. 
Linear behavior is strongly supported by the 
empirical data. It is difficult to be precise about 
the quality issue. Empirical data shows that the 
upper limit of the quality of this deterministic 
strategy is 96-98%. The inherent error rate is due 
to the use of heuristics. Nondeterministie parsers 
do not have such theoretical barriers. But this 
inherent error ratio should be contrasted with the 
fact that a deterministic parser produces the fight 
parse tree, while a nondeterministic parser 
produces usually only a forest of candidate parse 
trees accurately. 
At the moment of this writing this deterministic 
parser seems to have reached about 85% 
correctness rate (the average of the last five 
samples). Current errors are mainly lexical 
errors or gaps (about 9%) which usually can be 
easily corrected but the corrections improve the 
quality only slightly. Some 3% of the current 
errors are errors and gaps in the grammar. One 
should be cautious, however, of giving any 
precise numbers for parsing quality, since our 
exprerience shows that quality numbers vary 
markedly from one text to another. 
An interactive demonstration of the parser is 
available to the public for testing purposes at 
http://www'kielik°ne'fddcparser'fi'dem°" and 
the machine translation system (from Finnish 
into English) at http://www.kielikone.fi/fealcee. 
Acknowledgements 
My thanks go to the whole personnel of 
Kielikone Ltd. and, in particular, to Kaarina 
Hyvtnen, Jukka-Pekka Jnntunen, the late Tim 
Linnanvirta, and Asko Nyk~inen, who have 
contributed to the paper. I also want to thank the 
anonymous referees of this article. The Sirra 
Foundation and The Technology Development 
Centre have tinancially supported the work 
reported in this article. 

References 
Areola, H. (1998) The functional dependency 
structure of Finnish. (manuscript) 
Hays, D. (1964) Dependeno' theory: a formalism and 
some observations. Language, 40, pp. 511-525. 
Hellwig, P. (1986) Dependency md.fication grammar. 
Prof. COLING-86, Bonn. 
Hudson, R. (1976) Arguments for a Non- 
transformational Grammar. The University of 
Chicago Press. 
Jappinen, H., Lehtola, A., and Valkonen, K. (1986) 
Fzmctional structures for parsing dependency 
constraints. Prof. COLING-86, Bonn, pp. 461-463. 
Mel'chuk, I. (1988) Dependeno' Sj ~ltar : Theory cmd 
Practice. State University of New York Press. 
Nelimarkka, E., Jappinen, H., and Lehtola, A. (1984) 
Two-we 9, automata cmd dependeno, grammar: a 
parsing method for #lflectional free word-order 
iwtguages. Prec. COLING-84 and 22th ACL 
Meeting, Stanford, pp. 389-392. 
Nykanen, A (1996). Design and Implementation of 
ms Em, ironment for Parsing Finnish. M.Se. (Eng.) 
Thesis, Helsinki University of Technology, 
Department of Computer Science (in Finnish) 
Robinson, J. (1970) Dependency structure and 
trmlsformational ndes. Language, 2/46. 
Schubert, K. (1986) Linguistic cmd extra-linguistic 
lmow/edge. Computers and Translation, 1, pp. 125- 
152. 
Starosta, S. (1986) The Case for Lexicase. Pinter 
Publisher. 
Valkonen, K., Jappinen, H., and Lehtola, A. (1987) 
Blackboard-based dependency parsing. Prec. 
IJCAI-87,Milan, pp. 700-702. 
