File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2202_metho.xml
Size: 11,338 bytes
Last Modified: 2025-10-06 14:10:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2202"> <Title>Simple Information Extraction (SIE): A Portable and Effective IE System</Title> <Section position="4" start_page="9" end_page="10" type="metho"> <SectionTitle> 3 Instance Filtering </SectionTitle> <Paragraph position="0"> The purpose of the Instance Filtering (IF) module is to reduce the data set size and skewness by discarding harmful and superfluous instances without degrading the prediction accuracy. This is a generic module that can be exploited by any supervised system that casts IE as a classification problem.</Paragraph> <Paragraph position="1"> Instance Filtering (Gliozzo et al., 2005a) is based on the assumption that uninformative words are not likely to belong to entities to recognize, being their information content very low. A naive implementation of this assumption consists in filtering out very frequent words in corpora because they are less likely to be relevant than rare words. However, in IE relevant entities can be composed bymorethanonetokenandinsomedomainsafew of such tokens can be very frequent in the corpus. For example, in the field of bioinformatics, protein namesoftencontainparentheses, whosefrequency in the corpus is very high.</Paragraph> <Paragraph position="2"> Todealwiththisproblem, weexploitasetofInstance Filters (called Stop Word Filters), included in a Java tool called jInFil1. These filters perform a &quot;shallow&quot; supervision to identify frequent words that are often marked as positive examples.</Paragraph> <Paragraph position="3"> The resulting filtering algorithm consists of two stages. First, the set of uninformative tokens is identified by training the term filtering algorithm on the training corpus. Second, instances describing&quot;uninformative&quot;tokensareremovedfromboth null the training and the test sets. Note that instances are not really removed from the data set, but just marked as uninformative. In this way the learning algorithm will not learn from these instances, but they will still appear in the feature description of the remaining instances.</Paragraph> <Paragraph position="4"> A Stop Word Filter is fully specified by a list of stop words. To identify such a list, different feature selection methods taken from the text categorization literature can be exploited. In text categorization, feature selection is used to remove non-informative terms from representations of texts. In this sense, IF is closely related to feature selection: in the former non-informative words are removed from the instance set, while in the latter they are removed from the feature set. Below, we describe the different metrics used to collect a stop word list from the training corpora.</Paragraph> <Paragraph position="5"> InformationContent(IC) Themostcommonly used feature selection metric in text categorization is based on document frequency (i.e, the number of documents in which a term occurs). The basic assumption is that very frequent terms are non-informative for document indexing. The frequency of a term in the corpus is a good indicator of its generality, rather than of its information content. From this point of view, IF consists of removing all tokens with a very low information content2.</Paragraph> <Paragraph position="6"> Correlation Coefficient (CC) In text categorization thekh2 statistic is used to measure the lack of independence between a term and a category (Yang and Pedersen, 1997). The correlation coefficientCC2 = kh2 of a term with the negative class can be used to find those terms that are less likely to express relevant information in texts.</Paragraph> <Paragraph position="7"> Odds Ratio (OR) Odds ratio measures the ratio between the odds of a term occurring in the positive class, and the odds of a term occurring in the negative class. In text categorization the idea is that the distribution of the features on the relevant documents is different from the distribution on non-relevant documents (Raskutti and Kowalczyk, 2004). Following this assumption, a term is non-informative when its probability of being a negative example is sensibly higher than its probability of being a positive example (Gliozzo et al., 2005b).</Paragraph> <Paragraph position="8"> 2The information content of a word w can be measured by estimating its probability from a corpus by the equation I(w) = [?]p(w)logp(w).</Paragraph> <Paragraph position="9"> An Instance Filter is evaluated by using two metrics: the Filtering Rate (ps), the total percentage of filtered tokens in the data set, and the Positive Filtering Rate (ps+), the percentage of positive tokens (wrongly) removed. A filter is optimized by maximizing ps and minimizing ps+; this allows us to reduce as much as possible the data set size preserving most of the positive instances.</Paragraph> <Paragraph position="10"> We fixed the accepted level of tolerance (epsilon1) onps+ and found the maximum ps by performing 5-fold cross-validation on the training set.</Paragraph> </Section> <Section position="5" start_page="10" end_page="11" type="metho"> <SectionTitle> 4 Feature Extraction </SectionTitle> <Paragraph position="0"> The Feature Extraction module is used to extract for each input token a pre-defined set of features.</Paragraph> <Paragraph position="1"> As said above, we consider each token an instance to be classified as a specific entity boundary or not. To perform Feature Extraction an application called jFex3 was implemented. jFex generates the features specified by a feature extraction script, indexes them, and returns the example set, as well as the mapping between the features and their indices (lexicon). If specified, it only extracts features for the instances not marked as &quot;uninformative&quot; by instance filtering. jFex is strongly inspired by FEX (Cumby and Yih, 2003), but it introduces several improvements. First of all, it provides an enriched feature extraction language.</Paragraph> <Paragraph position="2"> Secondly, it makes possible to further extend this language through a Java API, providing a flexible tool to define task specific features. Finally, jFex can output the example set in formats directly usable by LIBSVM (Chang and Lin, 2001), SVMlight (Joachims, 1998) and SNoW (Carlson et al., 1999).</Paragraph> <Section position="1" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 4.1 Corpus Format </SectionTitle> <Paragraph position="0"> The corpus must be prepared in IOBE notation, a extension of the IOB notation. Both notations do not allow nested and overlapping entities. Tokens outside entities are tagged with O, while the first token of an entity is tagged with B-entity-type, the last token is tagged E-entity-type, and all the tokens inside the entity boundaries are tagged with I-entity-type, where entity-type is the type of the marked entity (e.g. protein, person).</Paragraph> <Paragraph position="1"> Beside the tokens and their types, the notation allows to represent general purpose and task-specific annotations defining new columns. Blank lines can be used to specify sentence or document boundaries. Table 1 shows an example of a prepared corpus. The columns are: the entity-type, the PoS tag, the actual token, the token index, and the output of the instance filter (the &quot;uninformative&quot; tokens are marked with 0) respectively.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.2 Extraction Language </SectionTitle> <Paragraph position="0"> As input to the begin and end classifiers, we use a bit-vector representation. Each instance is represented encoding all the following basic features for the actual token and for all the tokens in a context window of fixed size (in the reported experiments, 3 words before and 3 words after the actual token): Token The actual token.</Paragraph> <Paragraph position="1"> POS The Part of Speech (PoS) of the token.</Paragraph> <Paragraph position="2"> Token Shapes This feature maps each token into equivalence classes that encode attributes such as capitalization, numerals, single character, and so on.</Paragraph> <Paragraph position="3"> Bigrams of tokens and PoS tags.</Paragraph> <Paragraph position="4"> The Feature Extraction language allows to formally encode the above problem description through a script. Table 2 provides the extraction</Paragraph> </Section> </Section> <Section position="6" start_page="11" end_page="11" type="metho"> <SectionTitle> 5 Learning and Classification Modules </SectionTitle> <Paragraph position="0"> As already said, we approach IE as a classification problem, assigning an appropriate classification label to each token in the data set except for the tokens marked as irrelevant by the instance filter. As learning algorithm we use SVM-light5. In particular, we identify the boundaries that indicate the beginning and the end of each entity as two distinct classification tasks, following the approach adopted in (Ciravegna, 2000; Freitag and Kushmerick, 2000). All tokens that begin(end) an entity are considered positive instances for the begin(end) classifier, while all the remaining tokens are negative instances. In this way, two distinct models are learned, one for the beginning boundary and another for the end boundary. All the predictions produced by the begin and end classifiers are then paired by the Tag Matcher module.</Paragraph> <Paragraph position="1"> When we have to deal with more than one entity (i.e., with a multi-class problem) we train 2n binary classifiers (wherenis the number of entitytypes for the task). Again, all the predictions are paired by the Tag Matcher module.</Paragraph> </Section> <Section position="7" start_page="11" end_page="12" type="metho"> <SectionTitle> 6 Tag Matcher </SectionTitle> <Paragraph position="0"> All the positive predictions produced by the begin and end classifiers are paired by the Tag Matcher module. If nested or overlapping entities occur, even if they are of different types, the entity with the highest score is selected. The score of each entity is proportional to the entity length probability (i.e., the probability that an entity has a certain length)andthescoresassignedbytheclassifiersto the boundary predictions. Normalizing the scores makes it possible to consider the score function as a probability distribution. The entity length distribution is estimated from the training set.</Paragraph> <Paragraph position="1"> For example, in the corpus fragment of Table 3 the begin and end classifiers have identified four possible entity boundaries for the speaker of a seminar. In the table, the left column shows the actual label, while the right column shows the predictions and their normalized scores. The matching algorithm has to choose among three mutually exclusive candidates: &quot;Mr. John&quot;, &quot;Mr. John Smith&quot; and &quot;John Smith&quot;, with scores 0.23 x 0.12 x 0.33 = 0.009108, 0.23 x 0.34 x 0.28 = 0.021896 and 0.1 x 0.34 x 0.33 = 0.01122, respectively. The length distribution for the entity speaker is shown in Table 4. In this example, the matcher, choosing the candidate that maximizes thescorefunction,namelythesecondone,extracts the actual entity.</Paragraph> </Section> class="xml-element"></Paper>