File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2095_metho.xml
Size: 21,693 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2095"> <Title>A Formalism for Universal Segmentation of Text</Title> <Section position="2" start_page="0" end_page="658" type="metho"> <SectionTitle> 1 The Framework for Segmentation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="656" type="sub_section"> <SectionTitle> 1.1 Overview </SectionTitle> <Paragraph position="0"> The framework revolves around the document representation chosen for Sulno, which is a layered structure, each layer being a view of the document at a given level of seglnentation.</Paragraph> <Paragraph position="1"> These layers are introduced by the author of the segmentation application as needed and are not imposed by Sulno. The example in section 3.1 uses a two-layer structure (figure 4) corresponding to two levels of segmentation, characters and words. To extend this to a sentence seglnenter, a third level for sentences is added.</Paragraph> <Paragraph position="2"> These levels of segmentation can have a lin- null g,fistic or structural level, but &quot;artificiar' levels can be introduced a.s well when needed. It is also interesting to note that several layers can belong to the same level. In the example of section 3.3, the result structure can have an indefinite number of levels, and all levels are of the same kind. We (:all item the segmentation unit o\['a docuntent at a given segmentation level (e.g. items of the word level are words). The document is then represented at every segmentation level in 1;erms of its items; I)ecause segmentation is usually ambiguous, item .qraph.~ are used to \['actorize all the possible segmc'l,ta.tions. Ambiguity issues are furthel' addressed in section 2.3.</Paragraph> <Paragraph position="3"> The main processing i)aradigms of Sumo are ident{/icatio'n and h'ansJbrmation,. With ideutifical;ion, new item graphs are built by identif'ying items fi'om a source graph using a segmentation resource, q'hese graphs are 1;hen modified l)y translbrula.tion processes. Section 2 gives the details al)out both identificatio~l and t\]'a.nsfof mation.</Paragraph> <Paragraph position="4"> 1.:2 Item Graphs .</Paragraph> <Paragraph position="5"> 'l'lle iten:l gral)hs are directed acyclic gral)hs; they are similar to the word graphs of (Amtru 1) et al., 11996) or the string graphs of (C'olmerauer, 1970). They are actually rel)resente(I 1)y means of finite-sta.te automata (see section 2.\]). IH order to facilitate their manilmlation , two a(1ditio~tal prol)erties are on forced: these m Jtom ata ahvays lm.ve a single start-state and finite-slate, and no dangling arcs (this is verified by pruning the automata after modifications). The exampies of section 3 show va.rio~ls iteln graphs.</Paragraph> <Paragraph position="6"> An item is an arc in the automato~l. An arc is a complex structure containing a label (generally the surface /brm of the item), named attributes and relations. Attributes are llsed to hold information on the item, like part of speech tags (see section 3.2). These attributes can also be viewed as annotations in the same sense as the annotation graphs of (Bird el; 3l., 2000).</Paragraph> </Section> <Section position="2" start_page="656" end_page="656" type="sub_section"> <SectionTitle> 1.3 Relations </SectionTitle> <Paragraph position="0"> Relations are links between levels. Items from a given graph are linked to items of the graph from which they were identified. We call the first graph the Iowcr graph and the gral)h that was the source \[br the identification the upper graph. Relations exist between a path in the upper graph and either a path or a subgraph in the lower graph.</Paragraph> <Paragraph position="1"> Figure i illustrates the first kind of relation, called path relation. This example in French is a relation between the two characters of the word &quot;du&quot; which is really a contraction of&quot;de le&quot;. called subgraph relation. In this example the sentence ABCI)EI, G. (we can imagine that A through G are Chinese characters) is related to several possible segmentations.</Paragraph> <Paragraph position="3"> The interested reader may refer to (Pla.nas, 1998) for a conq)arable 8trllctul;e (multiple layers of a document and relations) used in tra.nslation memory.</Paragraph> </Section> <Section position="3" start_page="656" end_page="657" type="sub_section"> <SectionTitle> 2 Processing a Document 2.1 Description of a Docmnent </SectionTitle> <Paragraph position="0"> The core of the document representation is the item graph, which is represented by a finite-state automaton. Since regular expressions define finite-state automata, they can be used to describe an item graph. Itowever, our expressions are extended because the items are more complex than simple symbols; new operators are introduced: * attributes are introduced by an @ sign; * path relations are delimited by { and }; * tile inlbrmation concerning a given item are parenthesized using \[ and \].</Paragraph> <Paragraph position="1"> As an exemple, the relation of figure 1 is described by the following expression: \[ de le { d u } \]</Paragraph> </Section> <Section position="4" start_page="657" end_page="657" type="sub_section"> <SectionTitle> 2.2 Identification </SectionTitle> <Paragraph position="0"> Identification is the process of identifying new items froln a source graph. Using the source graph and a segmentation resource, new items are built to form a new graph. A segmentation resource, or simply resource, describes the vocabulary of the language, by defining a mapping between the source and the target level of segmentation. A resource is represented by a finite-state transducer in Sumo; identification is performed by applying the transducer to the source automaton to produce the target automaton, like in regular finite-state calculus.</Paragraph> <Paragraph position="1"> Resources can be compiled by regular expressions or indentification rules. In the former case, one can use the usual operations of finite-state calculus to compile the resource: union, intersection, composition, etc) A benefit of the use of Sumo structures to represent resources is that new resources can be built easily from the document that is being processed. (Quint, 1999) shows how to extract proper nouns from a text in order to extend the lexicon used by the segreenter to provide more acurate results.</Paragraph> <Paragraph position="2"> In the latter case, rules are specified a.s shown in section 3.3. The left hand side of a rule describes a suhpath in the source graph, while the right hand side describes the associated subpath in the target graph. A path relation is created between the two sequences of items. In an identific~tion rule, one can introduce variables (for: callback), and even calls to transformation functions (see next section). Naturally, these possibilities cannot be expressed by a strict finite-state structure, even with our extended formalism; hence, calculus with the resulting structures is limited.</Paragraph> <Paragraph position="3"> A special kind of identification is the automatic segmentation that takes place at the entry point of the process. A character graph can be created automatically by segmenting an input text document, knowing its encoding. This text document can be in raw form or XML format.</Paragraph> <Paragraph position="4"> accomodate the more complex nature of the items.</Paragraph> <Paragraph position="5"> of items that was created previously, either by Sumo, or converted to the tbrmat recognized by ~1_11\]10.</Paragraph> </Section> <Section position="5" start_page="657" end_page="658" type="sub_section"> <SectionTitle> 2.3 Transformation </SectionTitle> <Paragraph position="0"> Ambiguity is a central issue when talking about segmentation. Tile absence or ambiguity of word separators can lead to multiple segmentations, and more than one of them can have a meaning. As (Sproat et al., 1996) testify, several native Chinese speakers do not always agree on one unique tokenization for a given sentence.</Paragraph> <Paragraph position="1"> Th~nks to the use of item graphs, Sumo can handle ambiguity efficiently. Why try to fully disambiguate a tokenization when there is no agreement on a single best solution? Moreover:, segmentation is usually just a basic step of processing in an NLP system, and some decisions may need more information than what a setreenter is able to provide. An uninformed choice at this stage can affect the next stages in a negative way. Transformations are a way to modify the item graphs so that the &quot;good&quot; paths (segmentations) can be kept and the &quot;bad&quot; ones discarded. We can also of course provide fllll disambiguation (see section 3.1 for instance) by means of transformations.</Paragraph> <Paragraph position="2"> In Sumo transformations are handled by transformation 5mctions that manipulate the objects of the tbrmalism: graphs, nodes, items, paths (a special kind of graph), etc. These functions are written using an imperative language illustrated in section 3.1. A transformation can either be apl)lied directly to a graph or attached to a graph relation. In the latter case, the original graph is not modified, and its transformed counterpart is only accessible through the relation. null Transformation functions allow to control the flow of the process, using looping and conditionsis. An important implication is that a same resource can be applied iteratively; as shown by (Roche, 1994:) this feature allows to implement segmentation models much more powerful than simple regular languages (see section 3.3 for an example). Another consequence is that a Sumo application consists of one big transformation function returning the completed Sumo structure as a result.</Paragraph> </Section> </Section> <Section position="3" start_page="658" end_page="658" type="metho"> <SectionTitle> 3 Examples of Use </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="658" end_page="658" type="sub_section"> <SectionTitle> 3.1 Maximum tokenization </SectionTitle> <Paragraph position="0"> Some cla.ssic heuristics for tokenization a.re classified 1) 3, (G i% 1997) under the collective monil<er of mare\]mum tokenization. This s{Betion describes how to iml)lement a. &quot;maxilnnm tokenizer&quot; tha.t tokenizes raw text doculnerits in a l A\]- given language and cha.racter encoding (e.g. e a<(!l\] glish in ..... , French in Iso-Latin-l, Chinese ill Big5 or GB).</Paragraph> <Paragraph position="1"> Our tokenizer is built with two levels: the input level is the character level, automatically segmented using the encoding intbrmation. The token level is built from these cha, racters, first by ~li exllaustive identification of the tol<ens, then by re(hieing the UHlnber o\]&quot; 1)>~tlis to tile one coilsidere(1 tlle best 1)y the Ma.xil\]\]Ul\]\] \]bkenization heuristic.</Paragraph> <Paragraph position="2"> The system works ill three stel)S , with complete code shown ill figure 3. First, the character level is created 1) 3, automatic segnleutation (lines ;1-5, input levei being the special gi'aph that is automatically created from a. ra,w file throngh stdiu). The second step is to create the word grapli 1)y identif'ying words D'oln chata.ctoP llsiiig a dictiona.ry. A resour(:e called ABgdic is created from a transducer file (lines 6-8), then the gra,ph words is created by identifying it, enis from the SOllrCe level characters llSing the resoIIrCO ABCdic (lines 9-12). The third step is the disalnl)igua,tion of' the woM level t)y al)l)lying a, lqgure 4 illustrates the situatiori for the illput string &quot;ABCI)I~FG&quot; where A through G axe characters and A, AB, B, BC, 13Cl)\]';le, C, CI), 13, 1)E, E, F, I&quot;C and (3 are words folmd in the resource ABCdic. The situation shown is after</Paragraph> </Section> </Section> <Section position="4" start_page="658" end_page="661" type="metho"> <SectionTitle> AB A)LI G </SectionTitle> <Paragraph position="0"> We will see in the next three subsections l;he different heuristics and their implementations in Slll\]\]O.</Paragraph> <Paragraph position="1"> l%rward maxilnlltn 'lbkenization consists of scanning tile string from left to right and selecting the token of maxinulm lerigth any time an ambiglfity occurs. On the exalnple of figure d, tile resl,lt tokeliization of the inI)~lt string would 1)e A I~/CD/I'\]/IeG.</Paragraph> <Paragraph position="2"> lqgnre 5 shows a t'lmction called ft that 1)uilds a path recursively by traversing tile token graph, al)l)ending the longest item to the pa.th at each node. ft ta.kes a, node as input and retlirils a. path (line 1). If tile node is final, the enll)ty l)atll is retm'ned (lines 2-3), otherwise the array of items of tlle nodes In. items) is sea.rched and the longest item store(\] in longest (lines 4-10). The returned pa,th consists of this longest item prepended to the longest path starting from the destination node of this item (line 11).</Paragraph> <Paragraph position="3"> a.:t.a Backward Maxinmm Tokenlzation l~a.ckward Maximum Tokenization is tile same as librward Maximum 'lbkenization except that the string is scanned fi'om right to left, instead of left to right. On the example of figure 4, the tokenization of the input string would yield A/I~C/I)E/1,'C under Backward Maximum Tokenization. null A function bt can be written. It is very simila.r to ft, except that it works backward by looking at incoming arcs of' the considered node. bt is cMled on the final state of tile graph and stops when at the initial node. Another implementation of this function is to apply ft on the reversed graph and then reversing the path obtained. null Shortest Tokenization is concerned with minimizing (;he overall number of tokens in the text. On the example of figure 4, the tokenization of the input string would yield A/BCI)I~,I:/G under shortest tokenization.</Paragraph> <Paragraph position="4"> Figure 6 shows a fnnction called st that finds the shortest path in the graph. This function is adapted from an algorithm for: single-source shortest paths discovery in a DAG given by (Cormen et al., 1990). It calls another function, t_sort, returning a list of the nodes of the graph in topological order. The initializations are done in lines 2-6, the core of the algorithm is in the loop of lines 7-14 that computes the shortest path to every node, storing for each node its &quot;predecessor&quot;. Lines 1.5-20 then build the path, which is returned in line 21.</Paragraph> <Section position="1" start_page="659" end_page="659" type="sub_section"> <SectionTitle> 3.1.5 Combination of Maximum </SectionTitle> <Paragraph position="0"> Tokenization techniques One of the features of Sumo is to allow the comparison of different segmentation strategies using the same set of data. As we have .just seen, the three strategies described above can indeed be compared efficiently by modifying only part of the third step of the processing. Letting the system run three times on the same set of input documents can then give three different sets of results to be compared by the author of the system (against each other&quot; and against a reference tokenization, for instance).</Paragraph> <Paragraph position="1"> And yet a different set-up for our &quot;maximum tokenizer&quot; would be to select not .just the optimal pa.th according to one of the heuristics, but the paths selected by the three of them, as shown in figure 7. Combining the three paths into a graph is perfbrmed by changing line 13 in figure 3 to:</Paragraph> <Paragraph position="3"/> </Section> <Section position="2" start_page="659" end_page="660" type="sub_section"> <SectionTitle> 3.2 Statistical Tokenlzation and Part of Speech Tagging </SectionTitle> <Paragraph position="0"> This example shows a more complicated tokenization system, using the same sort of set-up as the one from section 3.1, with a disalnbiguation process using statistics (namely, a bigram model). Our reference for this model is the Chasen Japanese tokenizer and part of speech tagger documented in (Ma.tsumoto el; el., 1999). '.l'his example is a high-level description of how to implemen~ a simila.r system with Sumo.</Paragraph> <Paragraph position="1"> The set-up for this example adds a new level to the pre.vious example: the &quot;bigra.m level.&quot; The word level is still built by identification using dictionaries, then the bigraln level is built by computing a. connectivity cost between each pair of tokens. This is the level that will be used for disambigu~tion or selection of the best solutions.</Paragraph> <Paragraph position="2"> All possible segmentartiOns ~re derived from the character level to create the word level. Tim re,~onrce used \['or this is a dictionary of the la.ngua,ge that maps the surface form of the words (in terms of their characters) to their base form, part of speech, and a. cost (Chasell also a.dds l)ronunciation, co1\jugation type, and semantic information). All this inlbrmation is stored in the item as attril)utes, the base form heing used as the label for the item. I,'igure 8 sllows the identificaJ;ion of lille word &quot;ca.ts&quot; which is identified as &quot;cat&quot;, with category &quot;noun&quot; (i.e. @CAT=N) and with some cost k (@COST=k).</Paragraph> <Paragraph position="3"> The disambiguation method relies on a bigranl model: each pair of successive items has a &quot;connectivity cost&quot;. In the bigram level, tim &quot;cost&quot; attribute of an item W will be the connectivity cost of W and a following item X. Note that if a same W can be followed by severaJ items X, Y, etc. with different connectivity costs for e~ch p~tir, then W will be replicated with a. different &quot;cost&quot; attribute, l:igure 9 shows a word W followed by either X or Y, with two different connectivity costs h and U.</Paragraph> <Paragraph position="4"> The implementation of this technique in Su me is straightibrward. Assume there is a fllllCtion f that, given two items, computes their connectivity cost (depending on both of their category, i)ldividual cost, etc.) mid returns the first item with its modified cost. We write the following rule a.nd a,pply it to the word graph to creat;e the bigram graph:</Paragraph> <Paragraph position="6"> Tiffs r,lle can be read as: for any word $wl with any attribute (&quot; .&quot; matches any label, &quot;O .&quot; a.ny set of attributes) followed by any word $w2 with any attribute (&quot;_&quot; being a context separator), create the item returned by the fimction f ($ul, $u2).</Paragraph> <Paragraph position="7"> I)isambiguaJ;ion is then be perforlned by selecting the pa.th with optimal cost in this graph; but we ca,n also select a.ll paths with a cost col resl)onclillg to a certain threshold or the n best t)a.ths, etc. Note also that this model is easily extensible to any kind of n-grams. A new fllnction f($wl ..... Swn) must be provided to cornpule the connectivity costs of this sequence of items, and the above rule m,lst be modified to take a larger context into accom~t.</Paragraph> </Section> <Section position="3" start_page="660" end_page="661" type="sub_section"> <SectionTitle> 3.3 A Forlnal Exmnple </SectionTitle> <Paragraph position="0"> This last examl~h'~ is more formal and serw~s as an ilhlstra.tion of some powerful features of' Sumo. (Cohnerauer, 1970) has a similar exampie implemented using Q systems. In both cases the goaJ is to transform an input string of tilt lbrm a~'%&quot;~c '~, n > 0 into a single item ,S' (assuming theft the input a,lphal)et does not contain ,S'), meaning tha.t the input string is a word of this laaguage.</Paragraph> <Paragraph position="1"> The set-up here is once again to start with a lower level automatically created fl'om the input, then to build intermediate levels until ~ final level containing oMy the item S is produced (at which point the input is recognized), or until the process Call no longer carry on (at which point the input is rejected).</Paragraph> <Paragraph position="2"> .I hc building of intermediary levels is handled by the identifica.tioll rule below: What this rule does is identify a string of the form S?aa*bb*cc*, storing all a's but the first one in the varia.ble SA, all b's but the first one in $B and M1 c's but the first one in $C. The first triplet abc (with a possible S in front) is then absorbed by ,5', and the remaining a's, b's and c's are rewritten after ,5'.</Paragraph> <Paragraph position="3"> Figure 1.0 illustrates the first application of this rule to the input sequence aabbcc, creating the first intermediate level; subsequent applications of this rule will yield the only item ,5'. ,..~_)a a b b c c Figure 10: First application of the rule Conclusion We have described the main features of Sumo, a dedicated formalism \[br segmentation of text. A document is represented by item graphs at dif ferent levels of segmentation, which a.llows multiple segmentations of the same document a.t the same time. Three detailed ex~mples illustrated the features of Sumo discussed here. For the sake of simplicity some aspects could not be evoked in this paper, they include: management of the segmentation resources, ef\[iciency of the systems written in Sumo, larger a.pplications, evaluation of segmentation systems. Sumo is currently being prototyped by the author. null</Paragraph> </Section> </Section> class="xml-element"></Paper>