A Deterministic Word Dependency Analyzer
Enhanced With Preference Learning
Hideki Isozaki and Hideto Kazawa and Tsutomu Hirao
NTT Communication Science Laboratories
NTT Corporation
2-4 Hikaridai, Seikacho, Sourakugun, Kyoto, 619-0237 Japan
fisozaki,kazawa,hiraog@cslab.kecl.ntt.co.jp
Abstract
Word dependency is important in parsing tech-
nology. Some applications such as Informa-
tion Extraction from biological documents ben-
efit from word dependency analysis even with-
out phrase labels. Therefore, we expect an ac-
curate dependency analyzer trainable without
using phrase labels is useful. Although such
an English word dependency analyzer was pro-
posed by Yamada and Matsumoto, its accu-
racy is lower than state-of-the-art phrase struc-
ture parsers because of the lack of top-down in-
formation given by phrase labels. This paper
shows that the dependency analyzer can be im-
proved by introducing a Root-Node Finder and
a Prepositional-Phrase Attachment Resolver.
Experimental results show that these modules
based on Preference Learning give better scores
than Collins’ Model 3 parser for these subprob-
lems. We expect this method is also applicable
to phrase structure parsers.
1 Introduction
1.1 Dependency Analysis
Word dependency is important in parsing technol-
ogy. Figure 1 shows a word dependency tree. Eis-
ner (1996) proposed probabilistic models of depen-
dency parsing. Collins (1999) used dependency
analysis for phrase structure parsing. It is also stud-
ied by other researchers (Sleator and Temperley,
1991; Hockenmaier and Steedman, 2002). How-
ever, statistical dependency analysis of English sen-
tences without phrase labels is not studied very
much while phrase structure parsing is intensively
studied. Recent studies show that Information Ex-
traction (IE) and Question Answering (QA) benefit
from word dependency analysis without phrase la-
bels. (Suzuki et al., 2003; Sudo et al., 2003)
Recently, Yamada and Matsumoto (2003) pro-
posed a trainable English word dependency ana-
lyzer based on Support Vector Machines (SVM).
They did not use phrase labels by considering an-
notation of documents in expert domains. SVM
(Vapnik, 1995) has shown good performance in dif-
He
a
girl
a
telescope
with
saw
He saw a girl with a telescope.
Figure 1: A word dependency tree
ferent tasks of Natural Language Processing (Kudo
and Matsumoto, 2001; Isozaki and Kazawa, 2002).
Most machine learning methods do not work well
when the number of given features (dimensionality)
is large, but SVM is relatively robust. In Natural
Language Processing, we use tens of thousands of
words as features. Therefore, SVM often gives good
performance.
However, the accuracy of Yamada’s analyzer is
lower than state-of-the-art phrase structure parsers
such as Charniak’s Maximum-Entropy-Inspired
Parser (MEIP) (Charniak, 2000) and Collins’ Model
3 parser. One reason is the lack of top-down infor-
mation that is available in phrase structure parsers.
In this paper, we show that the accuracy of the
word dependency parser can be improved by adding
a base-NP chunker, a Root-Node Finder, and a
Prepositional Phrase (PP) Attachment Resolver. We
introduce the base-NP chunker because base NPs
are important components of a sentence and can be
easily annotated. Since most words are contained
in a base NP or are adjacent to a base NP, we ex-
pect that the introduction of base NPs will improve
accuracy.
We introduce the Root-Node Finder because Ya-
mada’s root accuracy is not very good. Each sen-
tence has a root node (word) that does not modify
any other words and is modified by all other words
directly or indirectly. Here, the root accuracy is de-
fined as follows.
Root Accuracy (RA) =
#correct root nodes / #sentences (= 2,416)
We think that the root node is also useful for depen-
dency analysis because it gives global information
to each word in the sentence.
Root node finding can be solved by various ma-
chine learning methods. If we use classifiers, how-
ever, two or more words in a sentence can be classi-
fied as root nodes, and sometimes none of the words
in a sentence is classified as a root node. Practically,
this problem is solved by getting a kind of confi-
dence measure from the classifier. As for SVM,
f(x) defined below is used as a confidence measure.
However, f(x) is not necessarily a good confidence
measure.
Therefore, we use Preference Learning proposed
by Herbrich et al. (1998) and extended by Joachims
(2002). In this framework, a learning system is
trained with samples such as “A is preferable to
B” and “C is preferable to D.” Then, the system
generalizes the preference relation, and determines
whether “X is preferable to Y” for unseen X and
Y. This framework seems better than SVM to select
best things.
On the other hand, it is well known that attach-
ment ambiguity of PP is a major problem in parsing.
Therefore, we introduce a PP-Attachment Resolver.
The next sentence has two interpretations.
He saw a girl with a telescope.
1) The preposition ‘with’ modifies ‘saw.’ That is, he
has the telescope. 2) ‘With’ modifies ‘girl.’ That is,
she has the telescope.
Suppose 1) is the correct interpretation. Then,
“with modifies saw” is preferred to “with mod-
ifies girl.” Therefore, we can use Preference
Learning again.
Theoretically, it is possible to build a new De-
pendency Analyzer by fully exploiting Preference
Learning, but we do not because its training takes
too long.
1.2 SVM and Preference Learning
Preference Learning is a simple modification of
SVM. Each training example for SVM is a pair
(yi; xi), where xi is a vector, yi = +1 means that
xi is a positive example, and yi =  1 means that xi
is a negative example. SVM classifies a given test
vector x by using a decision function
f(x) = wf   (x) + b =
‘X
i
yi iK(x; xi) + b;
where f ig and b are constants and ‘ is the number
of training examples. K(xi; xj) =  (xi)   (xj) is
a predefined kernel function.  (x) is a function that
maps a vector x into a higher dimensional space.
Training of SVM corresponds to the follow-
ing quadratic maximization (Cristianini and Shawe-
Taylor, 2000)
W( ) =
‘X
i=1
 i  12
‘X
i;j=1
 i jyiyjK(xi; xj);
where 0   i  C and P‘i=1  iyi = 0. C is a soft
margin parameter that penalizes misclassification.
On the other hand, each training example
for Preference Learning is given by a triplet
(yi; xi:1; xi:2), where xi:1 and xi:2 are vectors. We
use xi: to represent the pair (xi:1; xi:2). yi = +1
means that xi:1 is preferable to xi:2. We can regard
their difference  (xi:1)   (xi:2) as a positive ex-
ample and  (xi:2)   (xi:1) as a negative example.
Symmetrically, yi =  1 means that xi:2 is prefer-
able to xi:1.
Preference of a vector x is given by
g(x) = wg  (x) =
‘X
i
yi i(K(xi:1; x) K(xi:2; x)):
If g(x) > g(x0) holds, x is preferable to x0. Since
Preference Learning uses the difference  (xi:1)  
 (xi:2) instead of SVM’s  (xi), it corresponds to
the following maximization.
W( ) =
‘X
i=1
 i  12
‘X
i;j=1
 i jyiyjK(xi: ; xj: )
where 0   i  C and K(xi: ; xj: ) =
K(xi:1; xj:1)  K(xi:1; xj:2)  K(xi:2; xj:1) +
K(xi:2; xj:2). The above linear constraintP
‘
i=1  iyi = 0 for SVM is not applied toPreference Learning because SVM requires this
constraint for the optimal b, but there is no b in g(x).
Although SVMlight (Joachims, 1999) provides an
implementation of Preference Learning, we use our
own implementation because the current SVMlight
implementation does not support non-linear kernels
and our implementation is more efficient.
Herbrich’s Support Vector Ordinal Regression
(Herbrich et al., 2000) is based on Preference Learn-
ing, but it solves an ordered multiclass problem.
Preference Learning does not assume any classes.
2 Methodology
Instead of building a word dependency corpus from
scratch, we use the standard data set for comparison.
Dependency Analyzer PP-Attachment Resolver 
Root-Node Finder 
Base NP Chunker 
(POS Tagger)
 = SVM,  = Preference Learning
Figure 2: Module layers in the system
That is, we use Penn Treebank’s Wall Street Journal
data (Marcus et al., 1993). Sections 02 through 21
are used as training data (about 40,000 sentences)
and section 23 is used as test data (2,416 sentences).
We converted them to word dependency data by us-
ing Collins’ head rules (Collins, 1999).
The proposed method uses the following proce-
dures.
 A base NP chunker: We implemented an
SVM-based base NP chunker, which is a sim-
plified version of Kudo’s method (Kudo and
Matsumoto, 2001). We use the ‘one vs. all
others’ backward parsing method based on an
‘IOB2’ chunking scheme. By the chunking,
each word is tagged as
- B: Beginning of a base NP,
- I: Other elements of a base NP.
- O: Otherwise.
Please see Kudo’s paper for more details.
 A Root-Node Finder (RNF): We will describe
this later.
 A Dependency Analyzer: It works just like Ya-
mada’s Dependency Analyzer.
 A PP-Attatchment Resolver (PPAR): This re-
solver improves the dependency accuracy of
prepositions whose part-of-speech tags are IN
or TO.
The above procedures require a part-of-speech
tagger. Here, we extract part-of-speech tags from
the Collins parser’s output (Collins, 1997) for sec-
tion 23 instead of reinventing a tagger. According
to the document, it is the output of Ratnaparkhi’s
tagger (Ratnaparkhi, 1996). Figure 2 shows the ar-
chitecture of the system. PPAR’s output is used to
rewrite the output of the Dependency Analyzer.
2.1 Finding root nodes
When we use SVM, we regard root-node finding as
a classification task: Root nodes are positive exam-
ples and other words are negative examples.
For this classification, each word wi in a tagged
sentence T = (w1=p1;::: ;wi=pi;::: ;wN=pN) is
characterized by a set of features. Since the given
POS tags are sometimes too specific, we introduce
a rough part-of-speech qi defined as follows.
 q = N if p = NN, NNP, NNS,
NNPS, PRP, PRP$, POS.
 q = V if p = VBD, VB, VBZ, VBP,
VBN.
 q = J if p = JJ, JJR, JJS.
Then, each word is characterized by the following
features, and is encoded by a set of boolean vari-
ables.
 The word itself wi, its POS tags pi and
qi, and its base NP tag bi = B;I;O.
We introduce boolean variables such as
current word is John and cur-
rent rough POS is J for each of these
features.
 Previous word wi 1 and its tags, pi 1, qi 1,
and bi 1.
 Next word wi+1 and its tags, pi+1, qi+1, and
bi+1.
 The set of left words fw0;::: ;wi 1g, and
their tags, fp0;::: ;pi 1g, fq0;::: ;qi 1g, and
fb0;::: ;bi 1g. We use boolean variables such
as one of the left words is Mary.
 The set of right words fwi+1;::: ;wNg,
and their POS tags, fpi+1;::: ;pNg and
fqi+1;::: ;qNg.
 Whether the word is the first word or not.
We also add the following boolean features to get
more contextual information.
 Existence of verbs or auxiliary verbs (MD) in
the sentence.
 The number of words between wi
and the nearest left comma. We
use boolean variables such as near-
est left comma is two words away.
 The number of words between wi and the near-
est right comma.
Now, we can encode training data by using these
boolean features. Each sentence is converted to the
set of pairs f(yi; xi)g where yi is +1 when xi cor-
responds to the root node and yi is  1 otherwise.
For Preference Learning, we make the set of triplets
f(yi; xi:1; xi:2)g, where yi is always +1, xi:1 corre-
sponds to the root node, and xi:2 corresponds to a
non-root word in the same sentence. Such a triplet
means that xi:1 is preferable to xi:2 as a root node.
2.2 Dependency analysis
Our Dependency Analyzer is similar to Ya-
mada’s analyzer (Yamada and Matsumoto,
2003). While scanning a tagged sentence
T = (w1=p1;::: ;wn=pn) backward from the
end of the sentence, each word wi is classified into
three categories: Left, Right, and Shift.1
 Right: Right means that wi directly modifies
the right word wi+1 and that no word in T
modifies wi. If wi is classified as Right, the
analyzer removes wi from T and wi is regis-
tered as a left child of wi+1.
 Left: Left means that wi directly modifies the
left word wi 1 and that no word in T modifies
wi. If wi is classified as Left, the analyzer re-
moves wi from T and wi is registered as a right
child of wi 1.
 Shift: Shift means that wi is not next to its
modificand or is modified by another word in
T. If wi is classified as Shift, the analyzer
does nothing for wi and moves to the left word
wi 1.
This process is repeated until T is reduced to a
single word (= root node). Since this is a three-class
problem, we use ‘one vs. rest’ method. First, we
train an SVM classifier for each class. Then, for
each word in T, we compare their values: fLeft(x),
fRight(x), and fShift(x). If fLeft(x) is the largest,
the word is classified as Left.
However, Yamada’s algorithm stops when all
words in T are classified as Shift, even when T has
two or more words. In such cases, the analyzer can-
not generate complete dependency trees.
Here, we resolve this problem by reclassifying a
word in T as Left or Right. This word is selected in
terms of the differences between SVM outputs:
  Left(x) = fShift(x)  fLeft(x),
  Right(x) = fShift(x)  fRight(x).
These values are non-negative because fShift(x)
was selected. For instance,  Left(x) ’ 0 means that
fLeft(x) is almost equal to fShift(x). If  Left(xk)
gives the smallest value of these differences, the
word corresponding to xk is reclassified as Left. If
1Yamada used a two-word window, but we use a one-word
window for simplicity.
 Right(xk) gives the smallest value, the word cor-
responding to xk is reclassified as Right. Then, we
can resume the analysis.
We use the following basic features for each word
in a sentence.
 The word itself wi and its tags pi, qi, and bi,
 Whether wi is on the left of the root node or on
the right (or at the root node). The root node is
determined by the Root-Node Finder.
 Whether wi is inside a quotation.
 Whether wi is inside a pair of parentheses.
 wi’s left children fwi1;::: ;wikg, which
were removed by the Dependency Analyzer
beforehand because they were classified as
‘Right.’ We use boolean variables such as
one of the left child is Mary.
Symmetrically, wi’s right children
fwi1;::: ;wikg are also used.
However, the above features cover only near-
sighted information. If wi is next to a very long
base NP or a sequence of base NPs, wi cannot get
information beyond the NPs. Therefore, we add the
following features.
 Li;Ri: Li is available when wi immediately
follows a base NP sequence. Li is the word be-
fore the sequence. That is, the sentence looks
like:
. . . Li h a base NP i wi . . .
Ri is defined symmetrically.
The following features of neigbors are also used
as wi’s features.
 Left words wi 3;::: ;wi 1 and their basic fea-
tures.
 Right words wi+1;::: ;wi+3 and their basic
features.
 The analyzer’s outputs (Left/Right/Shift) for
wi+1;::: ;wi+3. (This analyzer runs backward
from the end of T.)
If we train SVM by using the whole data at once,
training will take too long. Therefore, we split
the data into six groups: nouns, verbs, adjectives,
prepositions, punctuations, and others.
2.3 PP attachment
Since we do not have phrase labels, we use all
prepositions (except root nodes) as training data.
We use the following features for resolving PP at-
tachment.
 The preposition itself: wi.
 Candidate modificand wj and its POS tag.
 Left words (wi 2;wi 1) and their POS tags.
 Right words (wi+1;wi+2) and their POS tags.
 Previous preposition.
 Ending word of the following base NP and its
POS tag (if any).
 i  j, i.e., Number of the words between wi
and wj.
 Number of commas between wi and wj.
 Number of verbs between wi and wj.
 Number of prepositions between wi and wj.
 Number of base NPs between wi and wj.
 Number of conjunctions (CCs) between wi and
wj.
 Difference of quotation depths between wi and
wj. If wi is not inside of a quotation, its quo-
tation depth is zero. If wj is in a quotation, its
quotation depth is one. Hence, their difference
is one.
 Difference of parenthesis depths between wi
and wj.
For each preposition, we make the set of triplets
f(yi; xi;1; xi;2)g, where yi is always +1, xi;1 corre-
sponds to the correct word that is modified by the
preposition, and xi;2 corresponds to other words in
the sentence.
3 Results
3.1 Root-Node Finder
For the Root-Node Finder, we used a quadratic ker-
nel K(xi; xj) = (xi  xj + 1)2 because it was better
than the linear kernel in preliminary experiments.
When we used the ‘correct’ POS tags given in the
Penn Treebank, and the ‘correct’ base NP tags given
by a tool provided by CoNLL 2000 shared task2,
RNF’s accuracy was 96.5% for section 23. When
we used Collins’ POS tags and base NP tags based
on the POS tags, the accuracy slightly degraded to
95.7%. According to Yamada’s paper (Yamada and
2http://cnts.uia.ac.be/conll200/chunking/
Matsumoto, 2003), this root accuracy is better than
Charniak’s MEIP and Collins’ Model 3 parser.
We also conducted an experiment to judge the ef-
fectiveness of the base NP chunker. Here, we used
only the first 10,000 sentences (about 1/4) of the
training data. When we used all features described
above and the POS tags given in Penn Treebank,
the root accuracy was 95.4%. When we removed
the base NP information (bi, Li, Ri), it dropped
to 94.9%. Therefore, the base NP information im-
proves RNF’s performance.
Figure 3 compares SVM and Preference Learn-
ing in terms of the root accuracy. We used the
first 10,000 sentences for training again. Accord-
ing to this graph, Preference Learning is better than
SVM, but the difference is small. (They are bet-
ter than Maximum Entropy Modeling3 that yielded
RA=91.5% for the same data.) C does not affect the
scores very much unless C is too small. In this ex-
periment, we used Penn’s ‘correct’ POS tags. When
we used Collins’ POS tags, the scores dropped by
about one point.
3.2 Dependency Analyzer and PPAR
As for the dependency learning, we used the same
quadratic kernel again because the quadratic kernel
gives the best results according to Yamada’s experi-
ments. The soft margin parameter C is 1 following
Yamada’s experiment. We conducted an experiment
to judge the effectiveness of the Root-Node Finder.
We follow Yamada’s definition of accuracy that ex-
cludes punctuation marks.
Dependency Accuracy (DA) =
#correct parents / #words (= 49,892)
Complete Rate (CR) =
#completely parsed sentences / #sentences
According to Table 1, DA is only slightly improved,
but CR is more improved.
3http://www2.crl.go.jp/jt/a132/members/mutiyama/
software.html
SVM
Preference Learning
Accuracy (%)
C0.10.030.010.0030.0010.00030.0001
90
91
92
93
94
95
96     
 
 
 
   
 
Figure 3: Comparison of SVM and Preference
Learning in terms of Root Accuracy (Trained with
10,000 sentences)
DA RA CR
without RNF 89.4% 91.9% 34.7%
with RNF 89.6% 95.7% 35.7%
The
Dependency Analyzer was trained with 10,000
sentences. RNF was trained with all of the training data.
DA: Dependency Accuracy, RA: Root Acc., CR:
Complete Rate
Table 1: Effectiveness of the Root-Node Finder
Accuracy (%)
C
Preference Learning
SVM
0.10.030.010.0030.0010.00030.000170
72
74
76
78
80
82
 
  
 
 
 
 
 
 
 
 
 
    
 
 
     
 
 
Figure 4: Comparison of SVM and Preference
Learning in terms of Dependency Accuracy of
prepositions (Trained with 5,000 sentences)
Figure 4 compares SVM and Preference Learning
in terms of the Dependency Accuracy of preposi-
tions. SVM’s performance is unstable for this task,
and Preference Learning outperforms SVM. (We
could not get scores of Maximum Entropy Model-
ing because of memory shortage.)
Table 2 shows the improvement given by PPAR.
Since training of PPAR takes a very long time, we
used only the first 35,000 sentences of the train-
ing data. We also calculated the Dependency Accu-
racy of Collins’ Model 3 parser’s output for section
23. According to this table, PPAR is better than the
Model 3 parser.
Now, we use PPAR’s output for each preposition
instead of the dependency parser’s output unless the
modification makes the dependency tree into a non-
tree graph. Table 3 compares the proposed method
with other methods in terms of accuracy. This data
except ‘Proposed’ was cited from Yamada’s paper.
IN TO average
Collins Model 3 84.6% 87.3% 85.1%
Dependency Analyzer 83.4% 86.1% 83.8%
PPAR 85.3% 87.7% 85.7%
PPAR was trained with 35,000 sentences. The number
of IN words is 5,950 and that of TO is 1,240.
Table 2: PP-Attachment Resolver
DA RA CR
with MEIP 92.1% 95.2% 45.2%
phrase info. Collins Model3 91.5% 95.2% 43.3%
without Yamada 90.3% 91.6% 38.4%
phrase info. Proposed 91.2% 95.7% 40.7%
Table 3: Comparison with related work
According to this table, the proposed method is
close to the phrase structure parsers except Com-
plete Rate. Without PPAR, DA dropped to 90.9%
and CR dropped to 39.7%.
4 Discussion
We used Preference Learning to improve the SVM-
based Dependency Analyzer for root-node finding
and PP-attachment resolution. Preference Learn-
ing gave better scores than Collins’ Model 3 parser
for these subproblems. Therefore, we expect that
our method is also applicable to phrase structure
parsers. It seems that root-node finding is relatively
easy and SVM worked well. However, PP attach-
ment is more difficult and SVM’s behavior was un-
stable whereas Preference Learning was more ro-
bust. We want to fully exploit Preference Learn-
ing for dependency analysis and parsing, but train-
ing takes too long. (Empirically, it takes O(‘2) or
more.) Further study is needed to reduce the compu-
tational complexity. (Since we used Isozaki’s meth-
ods (Isozaki and Kazawa, 2002), the run-time com-
plexity is not a problem.)
Kudo and Matsumoto (2002) proposed an SVM-
based Dependency Analyzer for Japanese sen-
tences. Japanese word dependency is simpler be-
cause no word modifies a left word. Collins and
Duffy (2002) improved Collins’ Model 2 parser
by reranking possible parse trees. Shen and Joshi
(2003) also used the preference kernel K(xi: ; xj: )
for reranking. They compare parse trees, but our
system compares words.
5 Conclusions
Dependency analysis is useful and annotation of
word dependency seems easier than annotation of
phrase labels. However, lack of phrase labels makes
dependency analysis more difficult than phrase
structure parsing. In this paper, we improved a de-
terministic dependency analyzer by adding a Root-
Node Finder and a PP-Attachment Resolver. Pref-
erence Learning gave better scores than Collins’
Model 3 parser for these subproblems, and the per-
formance of the improved system is close to state-
of-the-art phrase structure parsers. It turned out
that SVM was unstable for PP attachment resolu-
tion whereas Preference Learning was not. We ex-
pect this method is also applicable to phrase struc-
ture parsers.

References

Eugene Charniak. 2000. A maximum-entropy-inspired parser. In Proceedings of the North
American Chapter of the Association for Compu-
tational Linguistics, pages 132-139.

Michael Collins and Nigel Duffy. 2002. New ranking algorithms for parsing and tagging: Kernels
over discrete structures, and the voted perceptron. In Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics
(ACL), pages 263-270.

Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, pages 16-23.

Michael Collins. 1999. Head-Driven Statistical
Models for Natural Language Parsing. Ph.D.
thesis, Univ. of Pennsylvania.

Nello Cristianini and John Shawe-Taylor. 2000. An
Introduction to Support Vector Machines. Cam-
bridge University Press.

Jason M. Eisner. 1996. Three new probabilistic
models for dependency parsing: An exploration.
In Proceedings of the International Conference
on Computational Linguistics, pages 340-345.

Ralf Herbrich, Thore Graepel, Peter Bollmann-
Sdorra, and Klaus Obermayer. 1998. Learning
preference relations for information retrieval. In
Proceedings of ICML-98 Workshop on Text Cate-
gorization and Machine Learning, pages 80-84.

Ralf Herbrich, Thore Graepel, and Klaus Ober-
mayer, 2000. Large Margin Rank Boundaries for
Ordinal Regression, chapter 7, pages 115-132.
MIT Press.

Julia Hockenmaier and Mark Steedman. 2002.
Generative models for statistical parsing with
combinatory categorial grammar. In Proceedings
of the 40th Annual Meeting of the Association for
Computational Linguistics, pages 335-342.

Hideki Isozaki and Hideto Kazawa. 2002. Efficient
support vector classifiers for named entity recognition. In Proceedings of COLING-2002, pages
390-396.

Thorsten Joachims. 1999. Making large-scale
support vector machine learning practical. In
B. Scholkopf, C. J. C. Burges, and A. J. Smola,
editors, Advances in Kernel Methods, chapter 16,
pages 170-184. MIT Press.

Thorsten Joachims. 2002. Optimizing search en-
gines using clickthrough data. In Proceedings of
the ACM Conference on Knowledge Discovery
and Data Mining.

Taku Kudo and Yuji Matsumoto. 2001. Chunking
with support vector machines. In Proceedings of
NAACL-2001, pages 192-199.

Taku Kudo and Yuji Matsumoto. 2002. Japanese
dependency analysis using cascaded chunking.
In Proceedings of CoNLL, pages 63-69.

Mitchell P. Marcus, Beatrice Santorini, and Mary A.
Marcinkiewicz. 1993. Building a large annotated
corpus of english: the penn treebank. Computational Linguistics, 19(2):313-330.

Adwait Ratnaparkhi. 1996. A maximum entropy
part-of-speech tagger. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

Libin Shen and Aravind K. Joshi. 2003. An SVM
based voting algorithm with application to parse
reranking. In Proceedings of the Seventh Confer-
ence on Natural Language Learning, pages 9-16.

Daniel Sleator and Davy Temperley. 1991. Parsing
English with a Link grammar. Technical Report
CMU-CS-91-196, Carnegie Mellon University.

Kiyoshi Sudo, Satoshi Sekine, and Ralph Grishman.
2003. An improved extraction pattern representation model for automatic IE pattern acquisition. In Proceedings of the Annual Meeting of the
Association for Cimputational Linguistics, pages
224-231.

Jun Suzuki, Tsutomu Hirao, Yutaka Sasaki, and
Eisaku Maeda. 2003. Hierarchical direct acyclic
graph kernel: Methods for structured natural language data. In Proceedings of ACL-2003, pages
32-39.

Vladimir N. Vapnik. 1995. The Nature of Statisti-
cal Learning Theory. Springer.

Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis. In Proceedings of
the International Workshop on Parsing Technologies, pages 195-206.
