Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 320–327,
New York, June 2006. c©2006 Association for Computational Linguistics
Prototype-DrivenLearningforSequenceModels
AriaHaghighi
ComputerScienceDivision
Universityof CaliforniaBerkeley
aria42@cs.berkeley.edu
DanKlein
ComputerScienceDivision
Universityof CaliforniaBerkeley
klein@cs.berkeley.edu
Abstract
We investigate prototype-driven learning for pri-
marily unsupervised sequence modeling. Prior
knowledge is specified declaratively, by provid-
ing a few canonical examples of each target an-
notation label. This sparse prototype information
is then propagated across a corpus using distri-
butional similarity features in a log-linear gener-
ative model. On part-of-speech induction in En-
glishandChinese,aswellasaninformationextrac-
tion task, prototypefeaturesprovide substantialer-
ror rate reductions over competitive baselines and
outperform previous work. For example, we can
achieveanEnglishpart-of-speechtaggingaccuracy
of 80.5% using only three examples of each tag
and no dictionaryconstraints. We also compareto
semi-supervisedlearning and discuss the system’s
errortrends.
1 Introduction
Learning, broadly taken, involves choosing a good
modelfroma large spaceof possiblemodels. In su-
pervised learning, model behavior is primarily de-
termined by labeled examples, whose production
requires a certain kind of expertise and, typically,
a substantial commitment of resources. In unsu-
pervised learning, model behavior is largely deter-
mined by the structure of the model. Designing
models to exhibit a certain target behavior requires
another, rare kind of expertiseand effort. Unsuper-
vised learning, while minimizing the usage of la-
beled data, does not necessarily minimize total ef-
fort. We thereforeconsiderhere how to learn mod-
els withthe leasteffort. In particular, we arguefor a
certain kind of semi-supervisedlearning,which we
call prototype-drivenlearning.
Inprototype-drivenlearning,wespecifyprototyp-
ical examplesfor each target label or label configu-
ration,butdonotnecessarilylabelanydocumentsor
sentences. For example,when learninga model for
Penn treebank-style part-of-speech tagging in En-
glish,wemaylistthe45targettagsanda few exam-
plesofeachtag(seefigure4foraconcreteprototype
list for this task). This manner of specifying prior
knowledge about the task has several advantages.
First, is it certainly compact (though it remains to
be proven that it is effective). Second,it is more or
less the minimum one would have to provide to a
human annotator in order to specify a new annota-
tiontaskandpolicy (compare,forexample,withthe
list in figure 2, which suggestsan entirely different
task). Indeed, prototype lists have been used ped-
agogically to summarize tagsets to students (Man-
ning and Sch¨utze, 1999). Finally, natural language
doesexhibitproformandprototypeeffects(Radford,
1988), which suggests that learning by analogy to
prototypesmaybe effective for languagetasks.
In this paper, we consider three sequence mod-
eling tasks: part-of-speechtagging in English and
Chinese and a classifiedads informationextraction
task. Our general approach is to use distributional
similarity to link any given word to similar pro-
totypes. For example, the word reported may be
linked to said, which is in turn a prototype for the
part-of-speech VBD. We then encode these pro-
totype links as features in a log-linear generative
model, which is trained to fit unlabeled data (see
section 4.1). Distributional prototype features pro-
vide substantial error rate reductions on all three
tasks. For example, on English part-of-speechtag-
gingwiththreeprototypespertag,addingprototype
featuresto the baselineraises per-positionaccuracy
from41.3%to 80.5%.
2 TasksandRelatedWork: Tagging
Forourpart-of-speechtaggingexperiments,weused
data from the English and Chinese Penn treebanks
(Marcusetal.,1994;Ircs,2002). Examplesentences
320
(a) DT VBN NNS RB MD VB NNS TO VB NNS IN NNS RBR CC RBR RB .
The proposed changes also would allow executives to report exercises of options later and less often .
(b) NR AD VV AS PU NN VV DER VV PU PN AD VV DER VV PU DEC NN VV PU
! " # $ % & ’ ( ) * + , - . / 0 * + , 1 2 3 4 5 6 7
(c) FEAT FEAT FEAT FEAT NBRHD NBRHD NBRHD NBRHD NBRHD SIZE SIZE SIZE SIZE
Vine covered cottage , near Contra Costa Hills . 2 bedroom house ,
FEAT FEAT FEAT FEAT FEAT RESTR RESTR RESTR RESTR RENT RENT RENT RENT
modern kitchen and dishwasher . No pets allowed . 1050 / month$
Figure1: Sequencetasks: (a) EnglishPOS,(b) ChinesePOS,and(c) Classifiedad segmentation
are shown in figure1(a) and (b). A great deal of re-
search has investigated the unsupervisedand semi-
supervised induction of part-of-speechmodels, es-
pecially in English, and there is unfortunatelyonly
spaceto mentionsomehighlyrelatedwork here.
One approach to unsupervised learning of part-
of-speech models is to induce HMMs from un-
labeled data in a maximum-likelihood framework.
For example,Merialdo(1991)presentsexperiments
learningHMMs using EM. Merialdo’s results most
famously show that re-estimation degrades accu-
racy unless almost no examples are labeled. Less
famously, his results also demonstrate that re-
estimationcan improve tagging accuraciesto some
degreein the fullyunsupervisedcase.
One recent and much more successful approach
to part-of-speechlearningis contrastive estimation,
presentedin Smith and Eisner (2005). They utilize
task-specificcomparisonneighborhoodsforpart-of-
speechtaggingto altertheirobjective function.
Both of these works require specification of the
legal tagsfor eachword. Suchdictionariesare large
and embody a great deal of lexical knowledge. A
prototypelist,in contrast,is extremelycompact.
3 TasksandRelatedWork: Extraction
Grenager et al. (2005) presents an unsupervised
approach to an information extraction task, called
CLASSIFIEDS here,whichinvolvessegmentingclas-
sified advertisements into topical sections (see fig-
ure 1(c)). Labelsin this domaintend to be “sticky”
in that the correct annotation tends to consist of
multi-element fields of the same label. The over-
all approach of Grenager et al. (2005) typifies the
process involved in fully unsupervisedlearning on
new domain: they first alter the structure of their
HMMsothatdiagonaltransitionsarepreferred,then
modify the transition structure to explicitly model
boundary tokens, and so on. Given enough refine-
Label Prototypes
ROOMATES roommaterespectfuldrama
RESTRICTIONS petssmokingdog
UTILITIES utilitiespayselectricity
AVAILABLE immediatelybegincheaper
SIZE 2brsq
PHOTOS picturesimagelink
RENT $ month*number*15*1
CONTACT *phone*call*time*
FEATURES kitchenlaundryparking
NEIGHBORHOOD closenearshopping
ADDRESS addresscarlmont*ordinal*5
BOUNDARY ;. !
Figure 2: Prototype list derived from the develop-
ment set of the CLASSIFIEDS data. The BOUND-
ARY field is not present in the original annotation,
but added to model boundaries (see Section 5.3).
The starred tokens are the results of collapsing of
basic entities during pre-processing as is done in
(Grenageret al., 2005)
mentsthemodellearnstosegmentwithareasonable
matchto the target structure.
In section 5.3, we discuss an approach to this
task whichdoesnot requirecustomizationof model
structure,but rathercenterson featureengineering.
4 Approach
In the present work, we consider the problem of
learning sequence models over text. For each doc-
umentx = [xi], wewouldlike to predicta sequence
of labels y = [yi], where xi ∈ X and yi ∈ Y. We
constructa generative model,p(x,y|θ), whereθ are
the model’s parameters, and choose parameters to
maximizethelog-likelihoodofourobserveddataD:
L(θ;D) =
summationdisplay
x∈D
logp(x|θ)
=
summationdisplay
x∈D
log
summationdisplay
y
p(x,y|θ)
321
yi−1
〈DT,NN〉
yi
〈NN,VBD〉
xi
reported
xi−1
witness
f(xi,yi) =


word = reported
suffix-2 = ed
proto = said
proto = had


∧VBD
f(yi−1,yi) = DT∧NN∧VBD
Figure3: Graphicalmodelrepresentationof trigram
taggerfor EnglishPOSdomain.
4.1 Markov RandomFields
We take our model family to be chain-structured
Markov random fields (MRFs), the undirected
equivalent of HMMs. Our joint probability model
over (x,y) is given by
p(x,y|θ) = 1Z(θ)
nproductdisplay
i=1
φ(xi,yi)φ(yi−1,yi)
where φ(c) is a potentialover a clique c, taking the
form expbraceleftbigθTf(c)bracerightbig, and f(c) is the vector of fea-
tures active over c. In our sequence models, the
cliques are over the edges/transitions(yi−1,yi) and
nodes/emissions(xi,yi). See figure 3 for an exam-
ple fromthe EnglishPOStaggingdomain.
Note that the only way an MRF differs from
a conditional random field (CRF) (Lafferty et al.,
2001) is that the partition functionis no longer ob-
servationdependent;wearemodelingthejointprob-
ability of x and y insteadof y given x. As a result,
learning an MRF is slightly harder than learning a
CRF;we discussthisissuein section4.4.
4.2 Prototype-DrivenLearning
We assume prior knowledge about the target struc-
ture via a prototype list, which specifies the set of
target labels Y and, for each label y ∈ Y, a set of
prototypeswords, py ∈ Py. See figures2 and 4 for
examplesof prototypelists.1
1Note that this setting differs from the standard semi-
supervised learning setup, where a small number of fully la-
beledexamplesare given and used in conjunctionwith a larger
amountofunlabeleddata. Inourprototype-drivenapproach,we
never providea singlefullylabeledexamplesequence.Seesec-
tion5.3forfurthercomparisonofthissettingtosemi-supervised
learning.
Broadly, we would like to learn sequencemodels
which both explain the observed data and meet our
priorexpectationsabouttarget structure.A straight-
forward way to implementthis is to constraineach
prototype word to take only its given label(s) at
training time. As we show in section 5, this does
not work well in practicebecausethis constrainton
the modelis very sparse.
In providing a prototype, however, we generally
mean something stronger than a constraint on that
word. Inparticular, wemayintendthatwordswhich
areinsomesensesimilartoa prototypegenerallybe
given the samelabel(s)as thatprototype.
4.3 DistributionalSimilarity
In syntactic distributional clustering, words are
grouped on the basis of the vectors of their pre-
ceedingandfollowingwords(Sch¨utze,1995;Clark,
2001). The underlyinglinguisticidea is that replac-
ing a word with anotherword of the same syntactic
category shouldpreserve syntacticwell-formedness
(Radford, 1988). We present more details in sec-
tion 5, but for now assumethat a similarityfunction
over word typesis given.
Supposefurtherthatfor eachnon-prototypeword
type w, we have a subset of prototypes,Sw, which
are known to be distributionallysimilarto w (above
somethreshold). We would like our modelto relate
the tagsof w to thoseof Sw.
One approach to enforcing the distributional as-
sumptionin a sequencemodel is by supplementing
the training objective (here, data likelihood) with a
penalty term that encouragesparametersfor which
eachw’s posteriordistributionover tags is compati-
ble with it’s prototypesSw. For example,we might
maximize,
summationdisplay
x∈D
logp(x|θ)−
summationdisplay
w
summationdisplay
z∈Sw
KL(t|z||t|w)
where t|w is the model’s distribution of tags for
word w. The disadvantage of a penalty-basedap-
proach is that it is difficult to construct the penalty
term in a way which produces exactly the desired
behavior.
Instead, we introduce distributional prototypes
intothelearningprocessasfeaturesinourlog-linear
model. Concretely, for each prototypez, we intro-
duce a predicate PROTO = z which becomes active
322
at each w for whichz ∈ Sw (see figure3). One ad-
vantageofthis approachisthatitallowsthestrength
ofthedistributionalconstrainttobecalibratedalong
with any other features; it was also more successful
in our experiments.
4.4 ParameterEstimation
So far we have ignored the issue of how we learn
modelparametersθ whichmaximizeL(θ;D). Ifour
modelfamilywereHMMs,wecouldusetheEMal-
gorithm to perform a local search. Since we have
a log-linearformulation,we instead use a gradient-
based search. In particular, we use L-BFGS (Liu
and Nocedal,1989),a standardnumericaloptimiza-
tiontechnique,whichrequirestheabilitytoevaluate
L(θ;D) andits gradientat a given θ.
The density p(x|θ) is easily calculated up to the
global constant Z(θ) using the forward-backward
algorithm (Rabiner, 1989). The partition function
is given by
Z(θ) =
summationdisplay
x
summationdisplay
y
nproductdisplay
i=1
φ(xi,yi)φ(yi−1,yi)
=
summationdisplay
x
summationdisplay
y
score(x,y)
Z(θ) can be computed exactly under certain as-
sumptionsabout the cliquepotentials,but can in all
casesbe boundedby
ˆZ(θ) =
Ksummationdisplay
lscript=1
ˆZlscript(θ) =
Ksummationdisplay
lscript=1
summationdisplay
x:|x|=lscript
score(x,y)
WhereK isasuitablychosenlargeconstant.Wecan
efficientlycompute ˆZlscript(θ) for fixed lscript using a gener-
alization of the forward-backward algorithm to the
lattice of all observations x of length lscript (see Smith
andEisner(2005)for an exposition).
Similar to supervised maximum entropy prob-
lems, the partial derivative of L(θ;D) with respect
to each parameterθj (associatedwith feature fj) is
given by a differencein featureexpectations:
∂L(θ;D)
∂θj =
summationdisplay
x∈D
parenleftbigE
y|x,θfj −Ex,y|θfj
parenrightbig
Thefirstexpectationistheexpectedcountofthefea-
ture under the model’s p(y|x,θ) and is again eas-
ily computedwiththe forward-backward algorithm,
NumTokens
Setting 48K 193K
BASE 42.2 41.3
PROTO 61.9 68.8
PROTO+SIM 79.1 80.5
Table 1: English POS results measured by per-
positionaccuracy
just as for CRFsor HMMs. The secondexpectation
is the expectation of the feature under the model’s
joint distribution over all x,y pairs,and is harderto
calculate. Again assumingthat sentencesbeyond a
certainlengthhavenegligiblemass,wecalculatethe
expectationofthefeatureforeachfixedlengthlscriptand
take a (truncated)weightedsum:
Ex,y|θfj =
Ksummationdisplay
lscript=1
p(|x| = lscript)Ex,y|lscript,θfj
For fixedlscript, wecancalculateEx,y|lscript,θfj usingthelat-
ticeofallinputsoflengthlscript. Thequantityp(|x| = lscript)
is simply ˆZlscript(θ)/ˆZ(θ).
As regularization, we use a diagonal Gaussian
prior with varianceσ2 = 0.5, whichgave relatively
goodperformanceon all tasks.
5 Experiments
We experimentedwith prototype-driven learningin
three domains: English and Chinesepart-of-speech
taggingandclassifiedadvertisementfieldsegmenta-
tion. At inference time, we used maximum poste-
rior decoding,2 whichwe foundto be uniformlybut
slightlysuperiorto Viterbidecoding.
5.1 EnglishPOSTagging
ForourEnglishpart-of-speechtaggingexperiments,
we used the WSJ portion of the English Penn tree-
bank (Marcuset al., 1994). We took our data to be
eitherthefirst48Ktokens(2000sentences)or193K
tokens (8000sentences)startingfrom section2. We
used a trigramtaggerof the modelform outlinedin
section4.1 withthe sameset of spellingfeaturesre-
portedin SmithandEisner(2005): exactwordtype,
2At each position choosingthe label which has the highest
posterior probability, obtained from the forward-backward al-
gorithm.
323
Label Prototype Label Prototype
NN % company year NNS years shares companies
JJ new other last VBG including being according
MD will would could -LRB- -LRB- -LCB-
VBP are ’re ’ve DT the a The
RB n’t also not WP$ whose
-RRB- -RRB- -RCB- FW bono del kanji
WRB when how where RP Up ON
IN of in for VBD said was had
SYM c b f $ $ US$ C$
CD million billion two # #
TO to To na : – : ;
VBN been based compared NNPS Philippines Angels Rights
RBR Earlier duller “ “ ‘ non-“
VBZ is has says VB be take provide
JJS least largest biggest RBS Worst
NNP Mr. U.S. Corp. , ,
POS ’S CC and or But
PRP$ its their his JJR smaller greater larger
PDT Quite WP who what What
WDT which Whatever whatever . . ? !
EX There PRP it he they
” ” UH Oh Well Yeah
Figure4: EnglishPOSprototypelist
CorrectTag PredictedTag % of Errors
CD DT 6.2
NN JJ 5.3
JJ NN 5.2
VBD VBN 3.3
NNS NN 3.2
Figure5: MostcommonEnglishPOSconfusionsfor
PROTO+SIM on 193Ktokens
character suffixes of length up to 3, initial-capital,
contains-hyphen, and contains-digit. Our only edge
featuresweretag trigrams.
With just these features (our baseline BASE) the
problem is symmetric in the 45 model labels. In
order to break initial symmetry we initialized our
potentialsto be near one, with some random noise.
To evaluate in this setting, model labels must be
mapped to target labels. We followed the common
approach in the literature, greedily mapping each
model label to a target label in order to maximize
per-positionaccuracy on the dataset. The results of
BASE, reportedin table 1, dependupon randomini-
tialization; averaging over 10 runs gave an average
per-positionaccuracyof41.3%onthelargertraining
set.
We automaticallyextracted the prototype list by
taking our data and selectingfor each annotatedla-
bel the top three occurring word types which were
not given another label more often. This resulted
in 116 prototypesfor the 193K token setting.3 For
comparison,there are 18,423 word types occurring
in thisdata.
Incorporating the prototype list in the simplest
possibleway, we fixed prototypeoccurrencesin the
data to their respective annotation labels. In this
case, the model is no longer symmetric, and we
no longer require random initializationor post-hoc
mapping of labels. Adding prototypes in this way
gave an accuracy of 68.8% on all tokens, but only
47.7%on non-prototypeoccurrences,whichis only
a marginal improvement over BASE. It appears as
thoughtheprototypeinformationis notspreadingto
non-prototypewords.
In order to remedy this, we incorporated distri-
butional similarity features. Similar to (Sch¨utze,
1995),wecollectforeachwordtypeacontextvector
of the counts of the most frequent 500 words, con-
joinedwith a directionand distance(e.g +1,-2). We
thenperformedanSVDonthematrixtoobtaina re-
ducedrankapproximation.We usedthe dot product
betweenleft singularvectorsas a measureof distri-
butionalsimilarity. For eachwordw, wefindtheset
ofprototypewordswithsimilarityexceedinga fixed
threshold of 0.35. For each of these prototypes z,
we add a predicatePROTO = z to eachoccurrenceof
w. For example,wemightadd PROTO = saidto each
token of reported(as in figure3).4
Each prototype word is also its own prototype
(since a word has maximumsimilarityto itself), so
when we lock the prototypeto a label, we are also
pushingall the wordsdistributionallysimilarto that
prototypetowardsthatlabel.5
3To be clear: this method of constructing a prototype list
required statistics from the labeled data. However, we believe
it to be a fair and necessaryapproachfor several reasons. First,
wewantedourresultstoberepeatable.Second,wedidnotwant
to overly tune this list, though experimentsbelow suggest that
tuningcouldgreatlyreducethe errorrate. Finally, it allowedus
to run on Chinese,wherethe authorshave no expertise.
4Detailsof distributionalsimilarityfeatures:To extractcon-
text vectors,we useda window of size 2 in eitherdirectionand
use the first 250 singular vectors. We collected counts from
all the WSJ portion of the Penn Treebankas well as the entire
BLIPPcorpus. Welimitedeachwordtohavesimilarityfeatures
for its top 5 mostsimilarprototypes.
5Note that the presenceof a prototype feature does not en-
sure every instance of that word type will be given its proto-
type’s label; pressure from “edge” features or other prototype
featurescancauseoccurrencesofawordtypetobegivendiffer-
ent labels. However, rare words with a single prototypefeature
are almostalways given thatprototype’s label.
324
This setting, PROTO+SIM, brings the all-tokens
accuracy up to 80.5%, which is a 37.5% error re-
duction over PROTO. For non-prototypes,the accu-
racyincreasesto67.8%,anerrorreductionof38.4%
over PROTO. Theoverallerrorreductionfrom BASE
to PROTO+SIM on all-token accuracy is 66.7%.
Table 5 lists the most common confusions for
PROTO+SIM. The second, third, and fourth most
commonconfusionsarecharacteristicoffullysuper-
vised taggers (though greater in number here) and
are difficult. For instance,both JJs and NNs tend to
occur after determinersand before nouns. The CD
and DT confusionis a resultof ourprototypelistnot
containinga contains-digitprototypefor CD, so the
predicate fails to be linked to CDs. Of course in a
realistic, iterative design setting, we could have al-
tered the prototype list to include a contains-digit
prototypefor CD andcorrectedthisconfusion.
Figure 6 shows the marginal posterior distribu-
tion over label pairs (roughly, the bigram transi-
tionmatrix)accordingto thetreebanklabelsandthe
PROTO+SIM model run over the training set (using
a collapsed tag set for space). Note that the broad
structureis recoveredto a reasonabledegree.
It is difficult to compareour results to other sys-
tems which utilize a full or partial tagging dictio-
nary, since the amount of provided knowledge is
substantially different. The best comparison is to
Smith and Eisner (2005) who use a partial tagging
dictionary. In order to compare with their results,
we projectedthe tagset to the coarser set of 17 that
they used in their experiments. On 24K tokens, our
PROTO+SIM modelscored82.2%. WhenSmithand
Eisner(2005)limittheirtaggingdictionarytowords
which occur at least twice, their best performing
neighborhoodmodel achieves 79.5%. While these
numbers seem close, for comparison, their tagging
dictionary contained information about the allow-
able tags for 2,125 word types (out of 5,406 types)
and the their system must only choose, on average,
between 4.4 tags for a word. Our prototype list,
however, containsinformationaboutonly 116 word
types and our tagger must on average choose be-
tween 16.9 tags, a much harder task. When Smith
andEisner(2005)includetaggingdictionaryentries
forallwordsinthefirsthalfoftheir24Ktokens,giv-
ingtaggingknowledgefor3,362wordtypes,theydo
achieve a higheraccuracy of 88.1%.
Setting Accuracy
BASE 46.4
PROTO 53.7
PROTO+SIM 71.5
PROTO+SIM+BOUND 74.1
Figure 7: Results on test set for ads data in
(Grenageret al., 2005).
5.2 ChinesePOSTagging
WealsotestedourPOSinductionsystemontheChi-
nesePOSdatain theChineseTreebank(Ircs,2002).
The model is wholly unmodified from the English
version except that the suffix features are removed
since, in Chinese, suffixes are not a reliable indi-
cator of part-of-speechas in English (Tseng et al.,
2005). Sincewe did not have accessto a large aux-
iliaryunlabeledcorpusthatwas segmented,ourdis-
tributional model was built only from the treebank
text, and the distributional similarities are presum-
ablydegradedrelative to the English. On 60Kword
tokens, BASE gave anaccuracy of34.4, PROTO gave
39.0, and PROTO+SIM gave 57.4, similarin order if
not magnitudeto the Englishcase.
We believe theperformanceforChinesePOStag-
ging is not as high as English for two reasons: the
generaldifficultyof ChinesePOStagging(Tsenget
al., 2005)and the lack of a larger segmentedcorpus
fromwhichtobuilddistributionalmodels.Nonethe-
less,theadditionofdistributionalsimilarityfeatures
doesreducethe errorrateby 35%from BASE.
5.3 InformationFieldSegmentation
We tested our framework on the CLASSIFIEDS data
describedin Grenageret al.(2005)underconditions
similarto POS tagging. An importantcharacteristic
of thisdomain(seefigure1(a))is thatthe hiddenla-
bels tend to be “sticky,” in that fieldstend to consist
of runs of the same label, as in figure 1(c), in con-
trast with part-of-speechtagging, where we rarely
see adjacenttokens given the same label. Grenager
etal.(2005)reportthatinordertolearnthis“sticky”
structure, they had to alter the structure of their
HMM so that a fixed mass is placed on each diag-
onal transition. In this work, we learned this struc-
ture automaticallythough prototype similarity fea-
tures withoutmanuallyconstrainingthe model (see
325
INPUNC
PRT
TO
VBN
LPUNC
W
DET
ADV
V
POS
ENDPUNC
VBG
PREP
ADJ
RPUNC
N
CONJ
INPUNC PR
T
TO VBN LPUNC W DET AD
V
V POS ENDPUNC VBG PREP ADJ RPUNC N CONJ
INPUNC
PRT
TO
VBN
LPUNC
W
DET
ADV
V
POS
ENDPUNC
VBG
PREP
ADJ
RPUNC
N
CONJ
INPUNC PR
T
TO VBN LPUNC W DET AD
V
V POS ENDPUNC VBG PREP ADJ RPUNC N CONJ
(a) (b)
Figure 6: English coarse POS tag structure: a) correspondsto “correct” transition structure from labeled
data,b) correspondsto PROTO+SIM on 24Ktokens
ROOMATES
UTILITIES
RESTRICTIONS
AVAILABLE
SIZE
PHOTOS
RENT
FEATURES
CONTACT
NEIGHBORHOOD
ADDRESS
ROOMATES
UTILITIES
RESTRICTIONS
AVAILABLE
SIZE
PHOTOS
RENT
FEATURES
CONTACT
NEIGHBORHOOD
ADDRESS
ROOMATES
UTILITIES
RESTRICTIONS
AVAILABLE
SIZE
PHOTOS
RENT
FEATURES
CONTACT
NEIGHBORHOOD
ADDRESS
(a) (b) (c)
Figure 8: Field segmentation observed transition structure: (a) labeled data, (b) BASE(c)
BASE+PROTO+SIM+BOUND (afterpost-processing)
figure8), thoughwe did changethe similarity func-
tion(seebelow).
On the test set of (Grenager et al., 2005),
BASE scored an accuracy of 46.4%, comparable to
Grenager et al. (2005)’s unsupervisedHMM base-
line. Addingtheprototypelist(seefigure2)without
distributional features yielded a slightly improved
accuracy of 53.7%. For this domain, we utilized
a slightly different notion of distributional similar-
ity: we are not interested in the syntactic behavior
of a word type, but its topical content. Therefore,
when we collect context vectors for word types in
this domain, we make no distinction by direction
or distance and collect counts from a wider win-
dow. Thisnotionof distributionalsimilarityis more
similar to latent semantic indexing (Deerwester et
al., 1990). A natural consequenceof this definition
of distributionalsimilarityis that many neighboring
words will share the same prototypes. Therefore
distributional prototype features will encourage la-
bels to persist, naturally giving the “sticky” effect
of the domain. Addingdistributionalsimilarityfea-
turesto ourmodel(PROTO+SIM) improves accuracy
substantially, yielding 71.5%, a 38.4% error reduc-
tionover BASE.6
Another feature of this domain that Grenager et
al. (2005) take advantage of is that end of sen-
tence punctuation tends to indicate the end of a
field and the beginning of a new one. Grenager et
al. (2005)experimentwith manuallyaddingbound-
ary states and biasing transitions from these states
to not self-loop. We capture this “boundary” ef-
fect by simply adding a line to our protoype-list,
adding a new BOUNDARY state (see figure 2) with
a few (hand-chosen) prototypes. Since we uti-
lize a trigram tagger, we are able to naturally cap-
ture the effect that the BOUNDARY tokens typically
indicate transitions between the fields before and
after the boundary token. As a post-processing
step, when a token is tagged as a BOUNDARY
6Distributional similaritydetails: We collect for each word
a context vector consisting of the counts for words occurring
within three token occurrencesof a word. We performa SVD
ontothe first50 singularvectors.
326
CorrectTag PredictedTag % of Errors
FEATURES SIZE 11.2
FEATURES NBRHD 9.0
SIZE FEATURES 7.7
NBRHD FEATURES 6.4
ADDRESS NBRHD 5.3
UTILITIES FEATURES 5.3
Figure9: Mostcommonclassifiedads confusions
token it is given the same label as the previous
non-BOUNDARY token, which reflects the annota-
tionalconventionthatboundarytokensaregiventhe
same label as the field they terminate. Adding the
BOUNDARY label yields significant improvements,
as indicated by the PROTO+SIM+BOUND setting in
Table 5.3, surpassing the best unsupervised result
of Grenageret al. (2005) which is 72.4%. Further-
more, our PROTO+SIM+BOUND model comes close
to the supervisedHMMaccuracy of 74.4%reported
in Grenageret al. (2005).
We also compared our method to the most ba-
sicsemi-supervisedsetting,wherefullylabeleddoc-
uments are provided along with unlabeled ones.
Roughly 25% of the data had to be labeled
in order to achieve an accuracy equal to our
PROTO+SIM+BOUND model,suggestingthattheuse
ofpriorknowledgeintheprototypesystemispartic-
ularlyefficient.
In table 5.3, we provide the top confusionsmade
byour PROTO+SIM+BOUND model. Ascanbeseen,
many of our confusionsinvolve the FEATURE field,
whichserves as a generalpurposebackgroundstate,
which often differs subtly from other fields such as
SIZE. For instance,the parenthicalcomment:( mas-
ter has walk - in closet with vanity ) is labeled as
a SIZE field in the data, but our model proposed
it as a FEATURE field. NEIGHBORHOOD and AD-
DRESS is another natural confusion resulting from
the fact that the two fields share much of the same
vocabulary(e.g[ADDRESS 2525TelegraphAve.] vs.
[NBRHD nearTelegraph]).
Acknowledgments We would like to thank the
anonymous reviewers for their comments. This
workissupportedbyaMicrosoft/CITRISgrantand
by an equipmentdonationfromIntel.
6 Conclusions
Wehaveshownthatdistributionalprototypefeatures
can allow one to specify a target labeling scheme
in a compact and declarative way. These features
give substantialerrorreductionon several induction
tasksbyallowingonetolinkwordstoprototypesac-
cordingtodistributionalsimilarity. Anotherpositive
propertyof this approachis that it tries to reconcile
the successof sequence-freedistributional methods
in unsupervisedword clusteringwith the successof
sequencemodelsin supervisedsettings:thesimilar-
ity guidesthe learningof the sequencemodel.
References
AlexanderClark. 2001. Theunsupervisedinductionofstochas-
tic context-freegrammarsusingdistributionalclustering. In
CoNLL.
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer,
George W. Furnas, and Richard A. Harshman. 1990. In-
dexingby latentsemanticanalysis. Journalof the American
Societyof InformationScience, 41(6):391–407.
Trond Grenager, Dan Klein, and Christopher Manning. 2005.
Unsupervisedlearning of field segmentationmodels for in-
formationextraction. In Proceedingsof the 43rd Meetingof
the ACL.
NianwenXueIrcs. 2002. Buildinga large-scaleannotatedchi-
nesecorpus.
JohnLafferty, AndrewMcCallum,andFernando Pereira. 2001.
Conditional random fields: Probabilistic models for seg-
menting and labeling sequencedata. In InternationalCon-
ferenceon MachineLearning(ICML).
Dong C. Liu and Jorge Nocedal. 1989. On the limited mem-
ory bfgs methodfor large scaleoptimization. Mathematical
Programming.
ChristopherD. Manningand HinrichSch¨utze. 1999. Founda-
tions of StatisticalNatural Language Processing. The MIT
Press.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1994. Building a large annotated corpus
of english: The penn treebank. ComputationalLinguistics,
19(2):313–330.
Bernard Merialdo. 1991. Tagging english text with a proba-
bilisticmodel. In ICASSP, pages809–812.
L.R Rabiner. 1989. A tutorial on hidden markov models and
selectedapplicationsin speechrecognition. In IEEE.
Andrew Radford. 1988. TransformationalGrammar. Cam-
bridgeUniversityPress,Cambridge.
Hinrich Sch¨utze. 1995. Distributional part-of-speechtagging.
In EACL.
Noah Smith and Jason Eisner. 2005. Contrastive estimation:
Training log-linear models on unlabeled data. In Proceed-
ingsof the 43rd Meetingof the ACL.
Huihsin Tseng, Daniel Jurafsky, and Christopher Manning.
2005. Morphologicalfeatureshelp pos taggingof unknown
words across language varieties. In Proceedings of the
Fourth SIGHAN Workshop on Chinese Language Process-
ing.
327
