File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1061_metho.xml
Size: 23,228 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1061"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Segment-based Hidden Markov Models for Information Extraction</Title> <Section position="4" start_page="481" end_page="484" type="metho"> <SectionTitle> 2 Document-based HMM IE with the SGT smoothing </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="481" end_page="482" type="sub_section"> <SectionTitle> 2.1 HMM structure </SectionTitle> <Paragraph position="0"> We use a similar HMM structure (named as HMM Context) as in (Freitag and McCallum, 1999) for our document HMM IE system. An example of such an HMM is shown in Figure 1, in which the number of pre-context states, postcontext states, and the number of parallel filler paths are all set to 4, the default model parameter setting in our system.</Paragraph> <Paragraph position="2"> HMM Context consists of the following four kinds of states in addition to the special start and end states.</Paragraph> <Paragraph position="3"> Filler states Fillermn, m = 1,2,3,4 and n = 1,*** ,m states, correspond to the occurrences of filler tokens.</Paragraph> <Paragraph position="4"> Background state This state corresponds to the occurrences of the tokens that are not related to fillers or their contexts.</Paragraph> <Paragraph position="5"> Pre context states Pre4,Pre3,Pre2,Pre1 states correspond to the events present when context tokens occur before the fillers at the specific positions relative to the fillers, respectively.</Paragraph> <Paragraph position="6"> Post context states Post1,Post2,Post3,Post4 states correspond to the events present when context tokens occur after the fillers at the specific positions relative to the fillers, respectively.</Paragraph> <Paragraph position="7"> Our HMM structure differs from the one used in (Freitag and McCallum, 1999) in that we have added the transitions from the last post context state to every pre context state as well as every first filler state. This handles the situation where two filler occurrences in the document are so close to each other that the text segment between these two fillers is shorter than the sum of the pre context and the post context sizes.</Paragraph> </Section> <Section position="2" start_page="482" end_page="482" type="sub_section"> <SectionTitle> 2.2 Smoothing in HMM IE </SectionTitle> <Paragraph position="0"> There are many probabilities that need to be estimated to train an HMM for information extraction from a limited number of labelled documents.</Paragraph> <Paragraph position="1"> The data sparseness problem commonly occurring in probabilistic learning would also be an issue in the training for an HMM IE system, especially when more advanced HMM Context models are used. Since the emission vocabulary is usually large with respect to the number of training examples, maximum likelihood estimation of emission probabilities will lead to inappropriate zero probabilities for many words in the alphabet.</Paragraph> <Paragraph position="2"> The Simple Good-Turning (SGT) smoothing (Gale and Sampson, 1995) is a simple version of Good-Turning approach, which is a population frequency estimator used to adjust the observed term frequencies to estimate the real population term frequencies. The observed frequency distribution from the sample can be represented as a vector of (r,nr) pairs, r = 1,2,***. r values are the observed term frequencies from the training data, and nr refers to the number of different terms that occur with frequency r in the sample.</Paragraph> <Paragraph position="3"> For each r observed in the sample, the Good-Turning method gives an estimation for its real population frequency as r[?] = (r + 1)E(nr+1)E(nr) , where E(nr) is the expected number of terms with frequency r. For unseen events, an amount of probability P0 is assigned to all these unseen events, P0 = E(n1)N [?] n1N , where N is the total number of term occurrences in the sample.</Paragraph> <Paragraph position="4"> The SGT smoothing has been successfully applied to naive Bayes IE systems in (Gu and Cercone, 2006) for more robust probability estimation. We apply the SGT smoothing method to our HMM IE systems to alleviate the data sparseness problem in HMM training. In particular, the emission probability distribution for each state is smoothed using the SGT method. The number of unseen emission terms is estimated, as the observed alphabet size difference between the specific state emission term distribution and the all term distribution, for each state before assigning the total unseen probability obtained from the SGT smoothing among all these unseen terms.</Paragraph> <Paragraph position="5"> The data sparseness problem in probability estimation for HMMs has been addressed to some extent in previous HMM based IE systems (e.g., (Leek, 1997) and (Freitag and McCallum, 1999)).</Paragraph> <Paragraph position="6"> Smoothing methods such as absolute discounting have been used for this purpose. Moreover, (Freitag and McCallum, 1999) uses a shrinkage technique for estimating word emission probabilities of HMMs in the face of sparse training data. It first defines a shrinkage topology over HMM states, then learns the mixture weights for producing interpolated emission probabilities by using a separate data set that is &quot;held-out&quot; from the labelled data. This technique is called deleted interpolation in speech recognition (Jelinek and Mercer, 1980).</Paragraph> </Section> <Section position="3" start_page="482" end_page="483" type="sub_section"> <SectionTitle> 2.3 Experimental results on document HMM </SectionTitle> <Paragraph position="0"> IE and comparison to related work We evaluated our document HMM IE system on the seminar announcements IE domain using ten-fold cross validation evaluation. The data set consists of 485 annotated seminar announcements, with the fillers for the following four slots specified for each seminar: location (the location of a seminar), speaker (the speaker of a seminar), stime (the starting time of a seminar) and etime (the ending time of a seminar). In our HMM IE experiments, the structure parameters are set to system default values, i.e., 4 for both pre-context and postcontext size, and 4 for the number of parallel filler paths.</Paragraph> <Paragraph position="1"> Table 1 shows F1 scores (95% confidence intervals) of our Document HMM IE system (Doc HMM). The performance numbers from other HMM IE systems (Freitag and McCallum, 1999) are also listed in Table 1 for comparison, where HMM None is their HMM IE system that uses absolute discounting but with no shrinkage, and HMM Global is the representative version of their HMM IE system with shrinkage.</Paragraph> <Paragraph position="2"> By using the same structure parameters (i.e., the same context size) as in (Freitag and McCallum, 1999), our Doc HMM system performs consistently better on all slots than their HMM IE system using absolute discounting. Even compared to their much more complex version of HMM IE with shrinkage, our system has achieved comparable results on location, speaker and stime, but obtained significantly better performance on the etime slot. It is noted that our smoothing method is much simpler to apply, and does not require any extra effort such as specifying shrinkage topology or any extra labelled data for a held-out set.</Paragraph> </Section> <Section position="4" start_page="483" end_page="483" type="sub_section"> <SectionTitle> 3.1 Issue with document-based HMM IE </SectionTitle> <Paragraph position="0"> In existing HMM based IE systems, an HMM is used to model the entire document as one long observation sequence emitted from the HMM. The extracted fillers are identified by any part of the sequence in which tokens in it are labelled as one of the filler states. The commonly used structure of the hidden Markov models in IE allows multiple passes through the paths of the filler states. So it is possible for the labelled state sequences to present multiple filler extractions.</Paragraph> <Paragraph position="1"> It is not known from the performance reports from previous works (e.g., (Freitag and McCallum, 1999)) that how exactly a correct extraction for one document is defined in HMM IE evaluation. One way to define a correct extraction for a document is to require that at least one of the text segments that pass the filler states is the same as a labelled filler. Alternatively, we can define the correctness by requiring that all the text segments that pass the filler states are same as the labelled fillers. In this case, it is actually required an exact match between the HMM state sequence determined by the system and the originally labelled one for that document. Very likely, the former correctness criterion was used in evaluating these document-based HMM IE systems. We used the same criterion for evaluating our document HMM IE systems in Section 2.</Paragraph> <Paragraph position="2"> Although it might be reasonable to define that a document is correctly extracted if any one of the identified fillers from the state sequence labelled by the system is a correct filler, certain issues exist when a document HMM IE system returns multiple extractions for the same slot for one document. For example, it is possible that some of the fillers found by the system are not correct extractions. In this situation, such document-wise extraction evaluation alone would not be sufficient to measure the performance of an HMM IE system.</Paragraph> <Paragraph position="3"> Document HMM IE modelling does provide any guidelines for selecting one mostly likely filler from the ones identified by the state sequence matching over the whole document. For the template filling IE problem that is of our interest in this paper, the ideal extraction result is one slot filler per document. Otherwise, some further post-processing would be required to choose only one extraction, from the multiple fillers possibly extracted by a document HMM IE system, for filling in the slot template for that document.</Paragraph> </Section> <Section position="5" start_page="483" end_page="484" type="sub_section"> <SectionTitle> 3.2 Concept of document extraction </SectionTitle> <Paragraph position="0"> redundancy in HMM IE In order to make a more complete extraction performance evaluation in an HMM-based IE system, we introduce another performance measure, docu- null ones, then the extraction redundancy for that document is 0. Among all the issued extractions, the larger of the number of incorrect extractions is, the closer the extraction redundancy for that document is to 1. However, the extraction redundancy can never be 1 according to our definition, since this measure is only defined over the documents that contain at lease one correct extraction.</Paragraph> <Paragraph position="1"> Now let us have a look at the extraction redundancy in the document HMM IE system from Section 2. We calculate the average document extraction redundancy over all the documents that are judged as correctly extracted. The evaluation results for the document extraction redundancy (shown in column R) are listed in Table 2, paired with their corresponding F1 scores from the Generally speaking, the HMM IE systems based on document modelling has exhibited a certain extraction redundancy for any slot in this IE domain, and in some cases such as for speaker and stime, the average extraction redundancy is by all means not negligible.</Paragraph> </Section> </Section> <Section position="5" start_page="484" end_page="485" type="metho"> <SectionTitle> 4 Segment-based HMM IE Modelling </SectionTitle> <Paragraph position="0"> In order to make the IE system capable of producing the ideal extraction result that issues only one slot filler for each document, we propose a segment-based HMM IE framework in the following sections of this paper. We expect this framework can dramatically reduce the document extraction redundancy and make the resulting IE system output extraction results to the template filling IE task with the least post-processing requirement.</Paragraph> <Paragraph position="1"> The basic idea of our approach is to use HMMs to extract fillers from only extraction-relevant part of text instead of the entire document. We refer to this modelling as segment-based HMM IE, or segment HMM IE for brevity. The unit of the extraction-relevant text segments is definable according to the nature of the texts. For most texts, one sentence in the text can be regarded as a text segment. For some texts that are not written in a grammatical style and sentence boundaries are hard to identify, we can define a extraction-relevant text segment be the part of text that includes a filler occurrence and its contexts.</Paragraph> <Section position="1" start_page="484" end_page="484" type="sub_section"> <SectionTitle> 4.1 Segment-based HMM IE modelling: the </SectionTitle> <Paragraph position="0"> procedure By imposing an extraction-relevant text segment retrieval in the segment HMM IE modelling, we perform an extraction on a document by completing the following two successive sub-tasks. Step 1: Identify from the entire documents the text segments that are relevant to a specific slot extraction. In other words, the document is filtered by locating text segments that might contain a filler.</Paragraph> <Paragraph position="1"> Step 2: Extraction is performed by applying the segment HMM only on the extraction-relevant text segments that are obtained from the first step. Each retrieved segment is labelled with the most probable state sequence by the HMM, and all these segments are sorted according to their normalized likelihoods of their best state sequences. The filler(s) identified by the segment having the largest likelihood is/are returned as the extraction result.</Paragraph> </Section> <Section position="2" start_page="484" end_page="485" type="sub_section"> <SectionTitle> 4.2 Extraction from relevant segments </SectionTitle> <Paragraph position="0"> Since it is usual that more than one segment have been retrieved at Step 1, these segments need to compete at step 2 for issuing extraction(s) from their best state sequences found with regard to the HMM l used for extraction. For each segment s with token length of n, its normalized best state sequence likelihood is defined as follows.</Paragraph> <Paragraph position="2"> where l is the HMM and Q is any possible state sequence associated with s. All the retrieved segments are then ranked according to their l(s), and the segment with the highest l(s) number is selected and the extraction is identified from its labelled state sequence by the segment HMM.</Paragraph> <Paragraph position="3"> This proposed two-step HMM based extraction procedure requires that the training of the IE models follows the same style. First, we need to learn an extraction-relevance segment retrieval system from the labelled texts which will be described in detail in Section 5. Then, an HMM is trained for each slot extraction by only using the extraction-relevant text segments instead of the whole documents. null By limiting the HMM training to a much smaller part of the texts, basically including the fillers and their surrounding contexts, the alphabet size of all emission symbols associated with the HMM would be significantly reduced. Compared to the common document-based HMM IE modelling, our proposed segment-based HMM IE modelling would also ease the HMM training difficulty caused by the data sparseness problem since we are working on a smaller alphabet.</Paragraph> </Section> </Section> <Section position="6" start_page="485" end_page="486" type="metho"> <SectionTitle> 5 Extraction-relevant segment retrieval </SectionTitle> <Paragraph position="0"> using HMMs We propose a segment retrieval approach for performing the first subtask by also using HMMs. In particular, it trains an HMM from labelled segments in texts, and then use the learned HMM to determine whether a segment is relevant or not with regard to a specific extraction task. In order to distinguish the HMM used for segment retrieval in the first step from the HMM used for the extraction in the second step, we call the former one as the retrieval HMM and the later one as the extractor HMM.</Paragraph> <Section position="1" start_page="485" end_page="485" type="sub_section"> <SectionTitle> 5.1 Training HMMs for segment retrieval </SectionTitle> <Paragraph position="0"> To train a retrieval HMM, it requires each training segment to be labelled in the same way as in the annotated training document. After the training texts are segmented into sentences (we are using sentence as the segment unit), the obtained segments that carry the original slot filler tags are used directly as the training examples for the retrieval HMM.</Paragraph> <Paragraph position="1"> An HMM with the same IE specific structure is trained from the prepared training segments in exactly the same way as we train an HMM in the document HMM IE system from a set of training documents. The difference is that much shorter labelled observation sequences are used.</Paragraph> </Section> <Section position="2" start_page="485" end_page="485" type="sub_section"> <SectionTitle> 5.2 Segment retrieval using HMMs </SectionTitle> <Paragraph position="0"> After a retrieval HMM is trained from the labelled segments, we use this HMM to determine whether an unseen segment is relevant or not to a specific extraction task. This is done by estimating, from the HMM, how likely the associated state sequence of the given segment passes the target filler states. The HMM l trained from labelled segments has the structure as shown in Figure 1. So for a segment s, all the possible state sequences can be categorized into two kinds: the state sequences passing through one of the target filler path, and the state sequences not passing through any target filler states.</Paragraph> <Paragraph position="1"> Because of the structure constraints of the specified HMM in IE, we can see that the second kind of state sequences actually have only one possible path, denoted as Qbg in which the whole observation sequence of s starts at the background state qbg and continues staying in the background state until the end. Let s = O1O2***OT , where T is the length of s in tokens. The probability of s following this particular background state path Qbg can be easily calculated with respect to the HMM l as follows:</Paragraph> <Paragraph position="3"> where pii is the initial state probability for state i, bi(Ot) is the emission probability of symbol Ot at state i, and aij is the state transition probability from state i to state j.</Paragraph> <Paragraph position="4"> We know that the probability of observing s given the HMM l actually sums over the probabilities of observing s on all the possible state sequences given the HMM, i.e.,</Paragraph> <Paragraph position="6"> Let Qfiller denote the set of state sequences that pass through any filler states. We have {allQ} = Qbg[?]Qfiller. P(s|l) can be calculated efficiently using the forward-backward procedure which makes the estimate for the total probability of all state paths that go through filler states straightforward to be:</Paragraph> <Paragraph position="8"> considered more likely to have filler occurrence(s).</Paragraph> <Paragraph position="9"> Therefore in this case we classify s as an extraction relevant segment and it will be retrieved.</Paragraph> </Section> <Section position="3" start_page="485" end_page="486" type="sub_section"> <SectionTitle> 5.3 Document-wise retrieval performance </SectionTitle> <Paragraph position="0"> Since the purpose of our segment retrieval is to identify relevant segments from each document, we need to define how to determine whether a document is correctly filtered (i.e., with extraction relevant segments retrieved) by a given segment retrieval system. We consider two criteria, first a loose correctness definition as follows: Definition 2. A document is least correctly filtered by the segment retrieval system when at least one of the extraction relevant segments in that document has been retrieved by the system; otherwise, we say the system fails on that document.</Paragraph> <Paragraph position="1"> Then we define a stricter correctness measure as follows: Definition 3. A document is most correctly filtered by the segment retrieval system only when all the extraction relevant segments in that document have been retrieved by the system; otherwise, we say the system fails on that document.</Paragraph> <Paragraph position="2"> The overall segment retrieval performance is measured by retrieval precision (i.e., ratio of the number of correctly filtered documents to the number of documents from which the system has retrieved at least one segments) and retrieval recall (i.e., ratio of the number of correctly filtered documents to the number of documents that contain relevant segments). According to the just defined two correctness measures, the overall retrieval performance for the all testing documents can be evaluated under both the least correctly filtered and the least correctly filtered measures.</Paragraph> <Paragraph position="3"> We also evaluate average document-wise segment retrieval redundancy, as defined in Definition 4 to measure the segment retrieval accuracy. Definition 4. Document-wise segment retrieval redundancy is defined over the documents which are least correctly filtered by the segment retrieval system, as the ratio of the retrieved irrelevant segments to all retrieved segments for that document. 5.4 Experimental results on segment retrieval Table 3 shows the document-wise segment retrieval performance evaluation results under both least correctly filtered and most correctly filtered measures, as well as the related average number of retrieved segments for each document (as in Column nSeg) and the average retrieval redundancy.</Paragraph> <Paragraph position="4"> Shown from Table 3, the segment retrieval results have achieved high recall especially with the least correctly filtered correctness criterion. In addition, the system has produced the retrieval results with relatively small redundancy which means most of the segments that are fed to the segment HMM extractor from the retrieval step are actually extraction-related segments.</Paragraph> <Paragraph position="5"> 6 Segment vs. document HMM IE We conducted experiments to evaluate our segment-based HMM IE model, using the proposed segment retrieval approach, and comparing their final extraction performance to the document-based HMM IE model. Table 4 shows the overall performance comparison between the document HMM IE system (Doc HMM) and the segment HMM IE system (Seg HMM).</Paragraph> <Paragraph position="6"> Compared to the document-based HMM IE modelling, the extraction performance on location is significantly improved by our segment HMM IE system. The important improvement from the segment HMM IE system that it has achieved zero extraction redundancy for all the slots in this experiment. null</Paragraph> </Section> </Section> class="xml-element"></Paper>