File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2108_metho.xml
Size: 22,065 bytes
Last Modified: 2025-10-06 14:13:41
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2108"> <Title>CONTENT CHARACTERIZATION USING WORD SHAPE TOKENS</Title> <Section position="4" start_page="0" end_page="686" type="metho"> <SectionTitle> 2 WORD SHAPE TOKEN CREATION </SectionTitle> <Paragraph position="0"> In this section we briefly describe our system that constructs character shape codes and word shape tokens from a document linage (for more detail, see Nakayama and Spitz, 1993; Sibun and Spitz, forthcoming). To recognize character shape codex from an image, SOnle transfornlatitnls alc first nlade \[o correct for various scanning artifacts such as skew angle and text line cnrvature. On each text line, four horizontal lines define three significant zones: the area between the baseline and the top of characters such as &quot;x&quot; is the x cone; the area above the x-height level is the ascender,~one; the area below the x-zone is the descender zone (figure 1). Tim text line is furthcr divided into charactercells by vertical bonnda,ics which delineate the connected components of each character image. ~ The majority of characters can easily be mapped to a small numher of distinct ccMes (\[igure 2). 1 Cllaracters which are contained entirely in the x-zone map to shape code x ; characters which extend \[rom the baseline to alxwe the x-height line map to shape code A: and those which extend from below the baseline to the xqmight line map to shape code g. Characters which map to A, x, or g are composed o1 a single connected component. Some characters con|ain Fnorc than one connected component: an x-height character with a single diacritical mark in the ascender zone maps to i ; a character with a descender and a single diacritical mark maps to j. Most common punctuation marks map to unique shape codes; however, I If this nmppmg can bc done from docmncnt images, it can more trivially bc aCCOlnplished frmn character coded docmncnts, sllch as .,\St '.\[I text (providing, of course, that lhc method of encoding is known).</Paragraph> <Paragraph position="1"> some are mapped into shape codes shared with alphabetic characters (e.g., &quot;&&quot; maps to shape code A).</Paragraph> </Section> <Section position="5" start_page="686" end_page="686" type="metho"> <SectionTitle> 3 SttAPE CONVERSION </SectionTitle> <Paragraph position="0"> In general, our approach to docmnent processing finesses the problems iltllerent in mapping from an imagc to a character coded representation: we nlap instead frollt the imagc to a shal)e basedr~Tn'esentalion. This technique can transform evell a degraded document tillage itlto a representation which provides useful abstractions about the text of a document. The shape-based representation that we construct is proving to be a relnarkably rich source o1 information. While our initial goal has beell to, use it lor language identification in support of downstreanl OCR pr(x;esses, we are finding that this representation lnay be a sufficient source of information for document content characterization, such as that supported by part-of-speech lagging.</Paragraph> <Paragraph position="1"> In our tagging work, we have used character shape c~xtedtext derived froth normal character-c{~,led text. This is simply because we dc, tlOt have access to enough inlage documents on which to train a taggef. We call the process of creating a shape-Ntsed version ol the dtxxttnent lroln the character eerie based version shape conver.viotL For the purlx~se of text tagging, then, we cltn think oI the word shatx: token representation as an approximation of the representation composed of words. We can think about the relationship between words and word shape tokens its a mapping from a word to its corresponding word shape token. For example, the word &quot;apple&quot; maps to tile word shape token xggAx, and tile word &quot;apples&quot; maps to the word shape token x g g A x x.</Paragraph> <Paragraph position="2"> hi d(x;uments, words exist its sur/ace.fi~rms, not its morphological systems; thus &quot;apple&quot; and &quot;apples&quot; are different words. Therefore, it is of no use to us to have a lexicon organized in terms of stems and suffixes; i+rstcad, our lexicon is conlposed of stlrfaee forms like &quot;apple&quot; and &quot;apples&quot;. Throughout the rest of this paper, when we say &quot;words&quot;, we rllean words as Sillface ftwll\]S.</Paragraph> </Section> <Section position="6" start_page="686" end_page="686" type="metho"> <SectionTitle> 4 I'ART-OF-SI+EECli TAGGING </SectionTitle> <Paragraph position="0"> A part of speech tagger is at system that uses context to assign parts of speech to words. Part-of-speech information facilitates higher-level analysis, such as recognizing nOUll phrases and other patterns ill text* Several different approaches have been used for building text taggers. A particular fornl of Markov model has been widely used that assumes thai a word dcpends probabilistically on just its part-of-slx~eeh category, which m turn depends solely on the categories of the plecedmg two words. Training the trlodel is sonletinles doue by means of a large lagged corpus, but this is not necessary.</Paragraph> <Paragraph position="1"> The I~autn-Welch algorithm (Baum, 1972), also knowtt its the t;orward-l~,ackward algorithm, carl be used. In this ease, the model is called a hidden Markov nlodel (I IMM), since state transiticms (i.e., part-.of-speech categories) are assunled to be unobseuvable.</Paragraph> <Paragraph position="2"> l:or this work, we use an 11MM-based text tagger that is publicly available from Xerox PAP, C. As described in Cutting el al. (1902), the PAR(2 tagger is efficient and highly flexible. It is particularly important that the tagger can be trained on any eorptls el text, using ally lexicon.</Paragraph> <Paragraph position="3"> This flexibility allows us to shape-convert our training corpus and lexicon, its described in section 5, without needing to modify the tagger itself. Below we outline tile basic operation of tire PARC tagger; please refer to Cutting el al. (1902) for further detail.</Paragraph> <Paragraph position="4"> 1. Text destined for tire tagger first encotlllters a tokenizer, whose duty is to eoltVel+t text (a sequence el characters) into a sequence of tokens. Each sentence boundary is also identified by the tokenizer, and is passed its a special token.</Paragraph> <Paragraph position="5"> 2. The tokenizer passes tokens tC/+ the lexicon, where tokens are matched with a set of surface fofms, each annotated with a Imrt-of-speech tag. The set el tags constitutes an ambiguily class. The lexicon passes along a stream of (.~'llrfilce.fi)rnt, ambigltily class) pairs.</Paragraph> <Paragraph position="6"> 3a. In training mode, the tagger takes long sequences of ambiguity classes as input. It uses the Baum-Welch algorithm to produce a trained IIMM, which is used its input in tagging Inode. Training is performed on some corpus of interest; this corptlS lnay be of broad coverage or may be genre specific.</Paragraph> <Paragraph position="7"> 3b. Ill lagging mode, tile tagger buflers sequences el ambiguity classes between sentence boundmies. '\['hesc sequences are disambiguated by computing tile lnaximal path through the I IMM with the Viterbi algorithm (lO67).</Paragraph> <Paragraph position="8"> Operating at sentence granuhuity does llot sacrifiee accuracy, since sentence boundaries are unambiguous.</Paragraph> <Paragraph position="9"> Output consists of pairs of surface forms and tags.</Paragraph> </Section> <Section position="7" start_page="686" end_page="687" type="metho"> <SectionTitle> 5 THE LEXICON </SectionTitle> <Paragraph position="0"> The word shape tagging in our work follows tile t IMM4)ased process described above. Both word shape tagging atld standard word tagging require a lexicon.</Paragraph> <Section position="1" start_page="686" end_page="687" type="sub_section"> <SectionTitle> 5.1 Constructing tile Lexicon </SectionTitle> <Paragraph position="0"> A word shape lexicon can be derived from a standard lexicon of words. The lexicon used with the standard text tagger contains a list of all the distinct surface forms likely to be encountered m the hmguage. Associated with each surface form is a list of the possible parts of sIx'ech that the ~ttrface form can hitve. \];or exalllple: ijp~le noun ~LP~ i)hual noun eat verb eats third person singular verb t~l noun, adjective f.lle determiner ()liCe We have a lexicon which consists of sttrface fonns, we can use it to create a lexicOlt of word shape tokens for word shape tagging. In particular, this transR)rmatl ,n consists of the following steps: 1. Shape convert the surface forms to th, ir corresponding word shape tokens.</Paragraph> <Paragraph position="1"> 2. Sort the lexicon by surface form word shape. At this stage there may be duplicate word shape tokens. 3. Eliminate duplicate entries in the lexicon: collect all parts of speech behind one word shape token (combine their ambiguity classes). At this stage each word shape token should be unique.</Paragraph> <Paragraph position="2"> 4. Eliminate duplicate parts of speech behind each word shape token. At this stage each part of speech should be unique within each mnbiguity class. The lexicon fragment above would be converted to:</Paragraph> </Section> <Section position="2" start_page="687" end_page="687" type="sub_section"> <SectionTitle> 5.2 Analysis of the Lexicon </SectionTitle> <Paragraph position="0"> For this work, we use a lexicon provided by Xerox PARC. This lexicon is organized so that there is an entry for each of roughly 150,000 surface forms, l:or word shape tagging, we shape converted this lexicon. As can be seen in the table, shape conversion results ill about 50,000 distinct word shape surlace forms. This suggests that, on average, each word shape token is a mapping of three surlacc forms. However, about 30,000 of the word shape tokens arc unique, that is, correspond to a single surface form.</Paragraph> <Paragraph position="1"> Thus, the word shape lexicon is approximately one-third the size of the standard lexicon. Clearly, information has been lost, but not as much as one might think. In fact, the 20% of the word shape tokens that are unique carry exactly as much reformation as their corresponding character-coded words. While some surface forms that map to unique word shape tokens are long and infrequent (like &quot;flibbertigibbet&quot;, AAiAAxxAigiAAxA), many are short, Ct/lnlylOn words: While word shape tokens that are unique have the salne parts of speech as their corresponding surface forms, the others will tend on average to have many more parts <)l speech than an average stnTace form. This defxznds somewhat on the tagset (see section 6). In general, word shape tokens frequently have as many as 10 to 15 parts of speech, whereas standard surlace forms rarely have more than 4 or 5.</Paragraph> </Section> </Section> <Section position="8" start_page="687" end_page="687" type="metho"> <SectionTitle> 6 DEVISING THE TAGSET </SectionTitle> <Paragraph position="0"> The lagset is implicit in the lexicon: it includes all parts of speech listed in any entry of the lexicon; it also includes a small set of tags for punctuation, such as comma, hyphen, and sentence boundary. Although the tagset is not explicitly defined, we can modify it by mapping from selected tags fonnd in the lexicon to other tags of our choosing. For example, the lexicon distinguishes between verb tenses and has separate tags for different combinations of verb tense, person, and number: presenl tense verb, paxl lense verb, third pets'on singular present verb, etc. If we preferred, we could map all these different verb forms to a single verb tag. However, we typically prefer to maintain such distinctions, as the text taggcr can take advantage of differences in the surface forms of verbs with different tenses in order to uniquely identify their parts of speech.</Paragraph> <Paragraph position="1"> Shape com,ersion collapses different surface lorms onto one word shape and merges their ambiguity classes. The result is that them tend to be tcwer distract surface forms, and that each surface form has, on average, a larger ambiguity class. If this ambiguity is problematic, one way to reduce it may be to reduce the size of the tagset.</Paragraph> <Paragraph position="2"> t:or example, we may choose to have one undifferentiated verb tag rather than a set which differentiates tense, person, and namber. With fewer possible parts of speech to choose from, the HMM may find the part-of-speech selection more constrained. This in turn may improve its accuracy at selecting one of the tags that are available.</Paragraph> <Paragraph position="3"> The uninteresting case, of COtll'Se, is where every word shape has the same tag, that is, a tag set of one. This situation yields no useful syntaclic inforlnation from the doctlnlent. Since the use of word shape tokens doesreduce the mnount of information that is mailable to the tagger, it may rexluce the number of different tags it can accurately assign. The proper size of the tagset becomes conshained on one hand by the anloun\[ oJ syntactic ill\[Ormation we wish to extract (more inlk~rmation with a larger tagset) and on the other by the size of the ambiguity classes of the word shape tokens (more ambiguity with a larger tagset).</Paragraph> <Paragraph position="4"> Its proper size is thus an empirical question. For our tests we used tagsets vdth approximately 30 parts of speech.</Paragraph> </Section> <Section position="9" start_page="687" end_page="688" type="metho"> <SectionTitle> 7 TIlE TRAINING PROCESS </SectionTitle> <Paragraph position="0"> Just as the hiddcn Markov model fc, r standard tcxt tagging requires a large corpus of text to tram on, the word shape HMM requires a large corpns of text that has been converted to word shape tokens. We used at least 3.5 megabytes of ASCII text for our standard text laggcr's corpus; we then shape converted this text to create the corpus for the word shape tagger. This corpus consisted of a variety of different writing styles (from colloquial to professional) and difficulty levels (from casual Io erudite).</Paragraph> <Paragraph position="1"> \[-'2xamplcs include essays by huulorists, proposals lor new government lx~lieies, and classic works o\[ literature.</Paragraph> </Section> <Section position="10" start_page="688" end_page="688" type="metho"> <SectionTitle> 8 TtlE TA(;GIN(; PRO(~ENS </SectionTitle> <Paragraph position="0"> With the word Shal)C lexicon in place and tin adequatcly trained 1 \[MM, word shape tagging works just as stmldmd text tagging does. 111 part(el.liar, word simpe tagging consists of the following steps: I. A stieanit of tcxl is tokenized in(() a streani of w(,ird shap0 tok0ns segnlented itlto S0lltellces.</Paragraph> <Paragraph position="1"> 2. The slml)c-eonvcrted lexicon assigns an ambiguity class to caeh word sl,iape tokcll. Thc ucsult is i/ StlCi(lll ()l&quot; sentence++ composed of (word shape Ioke., amhig.ilv clas.v) pairs.</Paragraph> <Paragraph position="2"> 3. The laggcr uses thc trained hidden Mark(,iv model to comtmtc the highest probability part <11 speech for each word shape t(~ken in a sen(cute. The rcsult ix a stream of (word shape loken, part o/ speech) pairs, ~hich are grouped accordiilg to senletice bOUlMaties, W0 can limx us0 the r0sulting l,it/l'ts O( speeclit to illlOlM ()thor se~+(litleltts of i,i t|OetllTIoiql ulldelSlillldillg :;ystelll. The word shape ixut--ol-spcech laggcr tiros accepts w,ind shal)e tokens grouped by solltei,iee blltuldaries; within those boundaries, it assigiis the inl)sl likely part of speech t(~ ca(hi word shat~c tok0n.</Paragraph> </Section> <Section position="11" start_page="688" end_page="688" type="metho"> <SectionTitle> 9 RIPSSULTS </SectionTitle> <Paragraph position="0"> In thlis section, we introduce i,i tool which etin recognize noun phrases in sentences, and we use this tool to conipme the performalitcc (11 the standard taggcr and tim woix_l shape tagger. We exemplily the comparison with tWO texts: one on which the staitidard tagger perfoims very well, al,id oitic oitit which it does rehitti+ely p(+oity. While the word shape tagger does h:ss well in each case, its behavit/r tracks that (fl' the standard tagger, exhibiting siinihu&quot; successes aild faihlrcs, l:or the partieuhu task (/I iindiititg simple notln piuases, the word shape tagger's pei'l'01l/lililee is less than tJilit of the standard tlll~gcl's, bill a hu'ge !+l;aetion of the litOtliit phrases still are found.</Paragraph> <Paragraph position="1"> Wc have Lit s.ystcnl tht,it (:till ieeogiitiie sJlnplc lie(ill phrases whcn givei,i its input the seq,ileilce of tags Iot a SOl,licit(co+ t{ach of these phrases conlpriscs a contigtlous sequence o1 tags that satisfies a strut+h: gral,illilar, l&quot;or example, a II(,itlll pluase eltil be simi)ly a plonoull t~ig or (,in all:litlaly setitlellce (:,I lie(It1 lind ad.iective lags, pmsibly preceded by a dctell,iHiler lag and possibly with till embedded possessive lag. 2 The hlngesl possible S,ileh sequences itr0 I+otmd. (\]oi,ij,ili,ictions ale l}oi lec~>gliized (is piut of a llOUll pinase, llOl is prcp(+sithmal Dhirase :alhlehnient perf()rii,icd. We can bc eonlident of finding Ill(lily shnple no(it/phmtses b0cause the t~ old &quot;thC' hlas Ihe tnlique word shape /% A x. 3 I,~ccognilion el i1(1(111 phrases is a iirst sicp in topic idcntilieation: the topie (it a d(,iCUlilel,it is likely t<l be indicaled t)3 its lnosl hequent tie(in phrases.</Paragraph> <Paragraph position="2"> li,i 0vahialing., the hit(gel ellel rnle, wc rise s0veral liiteaS,illes (s0c tables). We calculate lhc pcleenlagc of IolaJ error~, thc percentage of Irh,Tal error& and the porcel,illlge ?. The i)osscssivc tap is tlsc(I for &quot; 's &quot; el ' r ,, as in &quot;the cat's l)ajanias' striF, es&quot;</Paragraph> </Section> <Section position="12" start_page="688" end_page="689" type="metho"> <SectionTitle> 3 Another I,inglish xvc)l'd, &quot;lhl,&quot; also maps Io AAx; </SectionTitle> <Paragraph position="0"> I'ollilllalcly, ill III+.)SI Ct)lllL'XlS Ihis word is l{llC (~1 l~erniciouserror~ (there me tit few eiT()lS that do not fall in either of the latter categories). Tagging 'lalaMning&quot; ill &quot;what the advocates a,e finding ahuming&quot; as tit present txuticipk: rather than as an adjeclivc is an examplc /fl a trivial error. Pernicious errors typically invoh, e re(stagging nouns as verbs or verbs as nouns (in l';nglish, there ~tlC ilially stlrtitce IOIIDS that can be either l,i()lllHlal (,il verbal). These latter el'l&quot;()i+s e0.11se probh:ms in h,itcr pl+oeessiitlg, suchl as dote.(ring simple ititOUitl phrases, sitice they May (IbNctll'l: 1101111 phl+a+~es or illh+odtlce spurious (/lies.</Paragraph> <Paragraph position="1"> We compatc the standat+d tagger and the word shape taggcr by counting the real(hem in the strcatns of output tags. We do not demand strict matches, but ms(cad allow the rags to belong to pertinent equivalence classes. I,'or exarnple, the standat+d tagger labels the noun &quot;monitors&quot; as a plural noun, at,id the word shape tagger la\[xels x x x i A x x x simply (is a litOut,i. We c()ititsider this a match, SillCe it l,i(Itllit \[ttitd a plkitit'ltl itit(3tllit iltl'e equally well recognized as pttrt o1 it lit)till( phrase, Ahl,iosl all instances ,(,it niismatehes rcstllt from the standard tagger being right and the word shape lagger being wrong. Very occasionally the situatiotit ix the reverse, but this ix to be expected as within the normal range of probabilities. More interesting is the observation that almost every pernicious error made by the standard tagger ix repeated by Ihe word shape tagger. Wc take this a+s c(,infirtnaltion of tim word shape tagger's ability to appmxintate the standard tagger's pcrtoimat,iee.</Paragraph> <Paragraph position="2"> The first COl/ll)arisoII of tagger peMormance involves a 30/!---w(,ird excell,it I+l+Ollit it govorl/I,ilelll, doctliititent. The standard lagger's I)eitT()itmance is better than 95c)~ correct. (it bettcr than 97% if trivial errors are disregarded. The word shape tagger's perl(irnuuitee is a 59% match (11 the standaFd tagger's (or 51% if only exact matches are considered).</Paragraph> <Paragraph position="3"> The noun phrase recogni/.er \[outld 113 sinlple limln phrases in the standard tagger's (,itltptlt iitlitd 77 ((~b;%) o1 these in the word shape taggcr's OUtl)Ut.</Paragraph> <Paragraph position="4"> The second comparison is of lags, big a 14+I word piece el IIOItSeIISC VCI'SC. The stiilldiild t+:.lg.gcr's i)et f+.)rnlilncc is 89% correct, or 94% disregarding trivial errors. The word shape tagger's perfornmnce is a 47% match (or 38% considering only exact matches). The noun phrase recognizer found 45 simple noun phrases in the slandm'd tagger's output and 17 (38%) of these in the word shape lagger's output.</Paragraph> <Paragraph position="5"> \[:urther study is needed to determine exactly how reliable word shape part-of-speech tagging and simple noun phrase recognition will be in finding the topic or topics in a document image, One means of improving this reliability may be our technique for grammatical function assignment which uses only the output of the part-of-speech taggerand phrase recognizers (Sibun 1991).</Paragraph> <Paragraph position="6"> However, we can abeady nse part-of-speech lagging and simple noun phrase recognition as a tool for discerning something about the content of the document by discovering at least some of its noun phrases, Since our document rceognition technology allows us to use word shape tokens to index directly into the document image, we can also identify parts of the image as promising candidates for OCR,</Paragraph> </Section> class="xml-element"></Paper>