File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1321_metho.xml
Size: 14,827 bytes
Last Modified: 2025-10-06 14:07:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1321"> <Title>Reducing Parsing Complexity by Intra-Sentence Segmentation based on Maximum Entropy Model</Title> <Section position="4" start_page="164" end_page="165" type="metho"> <SectionTitle> 2 Maximum Entropy Modeling </SectionTitle> <Paragraph position="0"> Sentence patterns or pattern ruels specify the sub-structures of the sentences. That is, segmentation positions are determined in view of the global sentence structure. If there is no matched rules or patterns with a given sentence, the sentence could not be segmented.</Paragraph> <Paragraph position="1"> We assume that whether a word is a segmentation position depends on its surrounding context. We try to find factors that affect the determination of segmentation positions. Maximum entropy is a technique for automatically acquiring knowledge from incomplete information, without making any unsubstantiated assumptions. It masters subtle effects so that we may accurately model subtle dependencies.</Paragraph> <Paragraph position="2"> It does not make any unwarranted assumptions, which means that maximum entropy learns exactly what the data says. Therefore it can perform well on unseen data.</Paragraph> <Paragraph position="3"> The idea is to construct a model that assigns a probability to each potential segmentation position in a sentence. We build a probability distribution p(ylx), where y * {0, 1} is a random variable specifying the potential segmentation position in a context x. A feature of a context is a binary-valued indicator function \] expressing the information about a specific context.</Paragraph> <Paragraph position="4"> Given a training sample of size N, (Xl,Yl),..-, (XN,YN), an empirical probability distribution can be defined as</Paragraph> <Paragraph position="6"> where #(x, y) is the number of occurrences of (x, y). The expected value of feature fi with respect to the empirical distribution i~(x, y) is expressed as x,y and the expected value of fi with respect to the probability distribution p(ylx) is p(.fi) -~ ~(x)pCylx).h(x, y), x~y where l~(x) is the empirical distribution of x in the corpus. We want to build probability distribution p(ylx) that is required to accord to the feature fi useful in selecting segmentation positions: P(fi) = IS(fi) for all fi * .T, where Y is the set of candidate features. This makes the probability distribution be built on only training data.</Paragraph> <Paragraph position="7"> Given a feature set .T, let C be the subset of all distributions P that satisfies the require-</Paragraph> <Paragraph position="9"> We choose a probability distribution consistent with all the facts, but otherwise as uniform as possible. The uniformity of the probability distribution p(ylx) is measured by the conditional entropy:</Paragraph> <Paragraph position="11"> Thus, the probability distribution with maximum entropy is the most uniform distribution.</Paragraph> <Paragraph position="12"> In building a model, we consider the linear exponential family Q given as</Paragraph> <Paragraph position="14"> where Ai are real-valued parameters and ZA(x) is a normalizing constant: = exp( y)).</Paragraph> <Paragraph position="15"> y i An intersection of the class Q of exponential models with the class of desired distribution (1) is nonempty, and the intersection contains the maximum entropy distribution and furthermore it is unique (Ratnaparkhi, 1994). Finding p. E C that maximizes H(p) is a problem in constrained optimization, which cannot be explicitly written in general. Therefore, we take advantage of the fact that the models in Q that satisfy p(fi) = 15(fi) can be explained under the maximum likelihood framework (Ratnaparkhi, 1994). Maximum likelihood principle also gives the unique distribution p., the intersection of the class Q with C.</Paragraph> <Paragraph position="16"> We assume each occurrence of (x,y) is sampled independently. Thus, log-likelihood L#(p) of the empirical distribution i5 as predicted by a model p can be defined as</Paragraph> <Paragraph position="18"> That is, the model we want to build is p. = arg xc = arg max qE~ The parameters A~ of exponential model (2) are obtained by the Generalized Iterative Scaling algorithm (Darroch and Ratcliff, 1972).</Paragraph> </Section> <Section position="5" start_page="165" end_page="167" type="metho"> <SectionTitle> 3 Construction of Features </SectionTitle> <Paragraph position="0"> This section describes the features. From a corpus, contextual evidences of segmentation positions are collected and combined, resulting in features. The features are used in identifying potential segmentation positions and included in the model.</Paragraph> <Section position="1" start_page="165" end_page="165" type="sub_section"> <SectionTitle> 3.1 Segmentable Positions and Safe Segmentation </SectionTitle> <Paragraph position="0"> A sentence is constructed by the combination of words, phrases, and clauses under the well-defined grammar. A sentence can be segmented into shorter segments that correspond to the constituents of the sentence. That is, segments correspond to the nonterminal symbols of the context-free grammar 1. The posi1Nonterminal symbols include the ones for phrases, such as NP (noun phrase) and VP (verb phrase), tion of a word is called segmentable position that can be a starting position of a specific segment.</Paragraph> <Paragraph position="1"> Though the analysis complexity can be reduced by segmenting a sentence, there is a mis-segmentation risk that causes parsing failures. A segmentation can be called safe segmentation that results in a coherent blocks of words. In English-Korean translation, safe segmentation is defined as the one which generates safe segments. A segment is safe, when there is a syntactic category symbol N P dominating the segment and the segment can be combined with adjacent segments under a given grammar. In Figure 1, (a) is an unsafe segmentation because the second segment cannot be analyzed into one syntactic category, resulting in parsing failure. By the safe segmentation (b), the first segment corresponds to a noun phrase and the second to a verb phrase, so that we can get a correct analysis result.</Paragraph> <Paragraph position="2"> (a) IThe students I who study hard will pass the exam\] (b) I The students who study hard I I will pass the exam\[</Paragraph> </Section> <Section position="2" start_page="165" end_page="167" type="sub_section"> <SectionTitle> 3.2 Lexical Contextual Constraints </SectionTitle> <Paragraph position="0"> A lexical context of a word includes seven-word window: three to the left of a word and three to the right of a word and a word itself.</Paragraph> <Paragraph position="1"> It also includes the part-of-speeches of these words, subcategorization information for two words to the left, and position value. The position value posi_v of the ith word wi is calculated as pos _v = r x R\], where n is the number of words and R 2 represents the number of regions in the sentence. Region is the sequentially ordered block of and the ones for clauses like RLCL (relative clause), SUBCL (subordinate clause).</Paragraph> <Paragraph position="2"> sit is a heuristically set value, and we set R as 4. words in a sentence, and posi_v represents the region in which a word lies. It is included to reflect the influence of the position of a word on being a segmentation position. Thus, the lexical context of a word is represented by 17 attributes as shown in Figure 2.</Paragraph> <Paragraph position="4"> An example of a training data and a resulting lexical context is shown in Figure 3.</Paragraph> <Paragraph position="5"> A symbol '#' represents a segmentation position marked by human annotators. Therefore, the lexical context of word when includes the value 1 for attribute s_position? and followings: three words to the left of when (became, terribly, and worried) and part-of-speeches of each word (VERB ADV ADJ), three words to the right (they, saw, and what) and part-of-speeches (PRON VERB PRON), subcategorization information for two words to the left (0 1), and position value (2).</Paragraph> <Paragraph position="6"> Of course his parents became terribly worried #when they saw what was happening a lexical context.</Paragraph> <Paragraph position="7"> To get reliable statistics, much training data is required. To alleviate this problem, we generate lexical contextual constralnts by combining lexical contexts and collect statistics for them. To generate lexical contextual constraints and to identify segmentable positions, we define two operations join (E9) and consistency (=). Let (al,...,an) and (bl,...,bn) be lexical contexts and (C1,... ,On) be lexical contextual constraint. The operation join is defined as</Paragraph> <Paragraph position="9"> where ',' is don't-care term accepting any value. A lexical contextual constraint is generated as a result of join operation. The consistency is defined as</Paragraph> <Paragraph position="11"> The algorithm for generating lexical contextual constraints is shown in Figure 4.</Paragraph> <Paragraph position="12"> * Input: a set of active lexical contexts LCw = {lcl... lcn} for word w, where lcc/= (al,..., an).</Paragraph> <Paragraph position="13"> * Output: a set of lexical contextual constraints LCCw = {/ccl .../cck}, where lcc/= (C1,..., Cn).</Paragraph> <Paragraph position="14"> 1. Initialize LCCw = 0 2. Do the followings for each l~ E LCw (a) For all lcj(j # i), Count(lcj) = # of matched attributes with Ic/ (b) max_cnt = arg maxlcC/ eLC. Count( Icj ) (c) For all lcj, where Count(lcj) = max..cnt, contextual constraints.</Paragraph> <Paragraph position="15"> A Icc plays the role of a feature. Following is an example of a feature.</Paragraph> <Paragraph position="17"> We collect the statistics for each Icc. The frequency of each lcc is counted as the number of lexical contexts that satisfy the consistency operation with the lcc.</Paragraph> <Paragraph position="19"> Identifying segmentable positions is performed with the consistency operation with the lexical context of word w and lcc E LCCw.</Paragraph> <Paragraph position="20"> The word whose lexical context is consistent with lcc is identified as a segmentable position. null</Paragraph> </Section> </Section> <Section position="6" start_page="167" end_page="168" type="metho"> <SectionTitle> 4 Determination Schemes of </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="167" end_page="167" type="sub_section"> <SectionTitle> Segmentation Positions </SectionTitle> <Paragraph position="0"> Segmentation positions are determined through two steps: identifying segmentable positions and selecting the most appropriate position among them. Segmentable positions are identified using the consistency operation. Maximum entropy model in Section 2 gives a probability to each position.</Paragraph> <Paragraph position="1"> Segmentation performance is measured in terms of coverage and accuracy. Coverage is the ratio of the number of actually segmented sentences to the number of segmentation target sentences that are longer than ot words, where o~ is a fixed constant distinguishing long sentences from short ones. Accuracy is evaluated in terms of the safe segmentation ratio. They are defined as follows: # of actually segmented Sent.</Paragraph> <Paragraph position="2"> coverage = ~ of Sent. to be segmented (3) # of Sent. with safe segmentation accuracy = ~ of actually segmented Sent. (a)</Paragraph> </Section> <Section position="2" start_page="167" end_page="167" type="sub_section"> <SectionTitle> 4.1 Baseline Scheme </SectionTitle> <Paragraph position="0"> No contextual information is used in identifying segmentable positions. They are empirically identified. A word that is tagged as a segmentation position more than 5 times is identified as a segmentable position. A set of segmentable positions, 9, is as follows.</Paragraph> <Paragraph position="1"> ~D = {wi \[ wi is tagged as segmentation position and #(tagged wi) >_ 5} In order to select the most appropriate position, the segmentation appropriateness of each position is evaluated by the probability of word wi: # of tagged wi p(Wi) = # of wi in the corpus p(wi) represents the tendency that word wi will be used as a segmentation position. A segmentation position w. is selected as the one that has highest p(wi) value:</Paragraph> <Paragraph position="3"> This scheme serves as a baseline for comparing the segmentation performance of the models.</Paragraph> </Section> <Section position="3" start_page="167" end_page="167" type="sub_section"> <SectionTitle> 4.2 A Scheme using Lexical Contextual Constraints </SectionTitle> <Paragraph position="0"> Lexical contextual constraints are used in identifying segmentable positions. Compared with the baseline scheme, this scheme considers contextual information of a word. All consistent words with the defined lexical contextual constraints form a set of segmentable positions 79.</Paragraph> <Paragraph position="1"> The maximum likelihood principle gives a probability distribution for p(y I lcc~), where y E {0, 1}. Segmentation appropriateness is evaluated by p(1 I lcew,). A position with the highest p(1 I lcc~) becomes a segmentation position:</Paragraph> <Paragraph position="3"/> </Section> <Section position="4" start_page="167" end_page="168" type="sub_section"> <SectionTitle> 4.3 A Scheme using Lexical Contextual Constraints with Word Sets </SectionTitle> <Paragraph position="0"> Due to insufficient training samples for constructing lexical contextual constraints, some segmentable positions may not be identified.</Paragraph> <Paragraph position="1"> To alleviate this problem we introduce word sets whose elements have linguistically similar features. We define four word sets: coordinate conjunction set, subordinate conjunction set, interogative set, auxiliary verb set. The categories of word sets and the examples of their members are shown in Table 1.</Paragraph> <Paragraph position="2"> members, interogatives 5 members, and auxiliary verbs have 12 members now. The words belonging to each word set are treated equally. Lexical contextual constraints are constructed for words and word sets, so the statistics is collected for both of them. The set of segmentable positions T~ is defined somewhat differently as:</Paragraph> <Paragraph position="4"> where wsj denotes a word set to which the jth word in a sentence belongs.</Paragraph> <Paragraph position="5"> In this scheme, p(1 I Iccc,,,) or p(1 \] lccws,) expresses the segmentation appropriateness of the position. Therefore, a segmentation position is determined by w, = arg max {p(1 I lcc ,), p(1 I lcc s )}. {w,,ws~}~9</Paragraph> </Section> </Section> class="xml-element"></Paper>