File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0503_metho.xml
Size: 6,771 bytes
Last Modified: 2025-10-06 14:10:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0503"> <Title>Max-Planck-Institute for Computer Science</Title> <Section position="5" start_page="19" end_page="20" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 2.1 Document Pre-Processing </SectionTitle> <Paragraph position="0"> LEILA accepts HTML documents as input. To allow the system to handle date and number expressions, we normalize these constructions by regular expression matching in combination with a set of functions. For example, the expression &quot;November 23rd to 24th 1998&quot; becomes &quot;1998-11-23 to 1998-11-24&quot; and the expression &quot;0.8107 acre-feet&quot; becomes &quot;1000 cubic-meters&quot;. Then, we split the original HTML-document into two files: The first file contains the proper sentences with the HTML-tags removed. The second file contains the non-grammatical parts, such as lists, expressions using parentheses and other constructions that cannot be handled by the Link Parser. For example, the character sequence &quot;Chopin (born 1810) was a great composer&quot; is split into the sentence &quot;Chopin was a great composer&quot; and the non-grammatical information &quot;Chopin (born 1810)&quot;. The grammatical file is parsed by the Link Parser.</Paragraph> <Paragraph position="1"> The parsing allows for a restricted named entity recognition, because the parser links noun groups like &quot;United States of America&quot; by designated connectors. Furthermore, the parsing allows us to do anaphora resolution. We use a conservative approach, which simply replaces a third person pronoun by the subject of the preceding sentence.</Paragraph> <Paragraph position="2"> For our goal, it is essential to normalize nouns to their singular form. This task is non-trivial, because there are numerous words with irregular plural forms and there exist even word forms that can be either the singular form of one word or the plural form of another. By collecting these exceptions systematically from WordNet, we were able to stem most of them correctly with our Plural-to-Singular Stemmer (PlingStemmer1). For the non-grammatical files, we provide a pseudo-parsing, which links each two adjacent items by an artificial connector. As a result, the uniform output of the preprocessing is a sequence of linkages, which constitutes the input for the core algorithm.</Paragraph> </Section> <Section position="2" start_page="19" end_page="20" type="sub_section"> <SectionTitle> 2.2 Core Algorithm </SectionTitle> <Paragraph position="0"> As a definition of the target relation, our algorithm requires a function (given by a Java method) that decides into which of the following categories a pair of words falls: * The pair can be an example for the target relation. For instance, for the birthdaterelation, the examples can be given by a list of persons with their birth dates.</Paragraph> <Paragraph position="1"> * The pair can be a counterexample. For the birthdate-relation, the counterexamples can be deduced from the examples (e.g. if &quot;Chopin&quot; / &quot;1810&quot; is an example, then &quot;Chopin&quot; / &quot;2000&quot; must be a counterexample).</Paragraph> <Paragraph position="2"> * The pair can be a candidate. For birthdate, the candidates would be all pairs of a proper name and a date that are not an example or a counterexample.</Paragraph> <Paragraph position="3"> * The pair can be none of the above.</Paragraph> <Paragraph position="4"> The core algorithm proceeds in three phases: 1. In the Discovery Phase, it seeks linkages in which an example pair appears. It replaces the two words by placeholders, thus producing a pattern. These patterns are collected as positive patterns. Then, the algorithm runs through the sentences again and finds all linkages that match a positive pattern, but produce a counterexample. The corresponding patterns are collected as negative patterns2.</Paragraph> <Paragraph position="5"> 2. In the Training Phase, statistical learning is applied to learn the concept of positive patterns. The result of this process is a classifier for patterns. null 3. In the Testing Phase, the algorithm considers again all sentences in the corpus. For each linkage, it generates all possible patterns by replacing two words by placeholders. If the two words form a candidate and the pattern is classified as positive, the produced pair is proposed as a new element of the target relation (an output pair). In principle, the core algorithm does not depend on a specific grammar or a specific parser. It can work on any type of grammatical structures, as long as some kind of pattern can be defined on them. It is also possible to run the Discovery Phase and the Testing Phase on different corpora.</Paragraph> </Section> <Section position="3" start_page="20" end_page="20" type="sub_section"> <SectionTitle> 2.3 Learning Model </SectionTitle> <Paragraph position="0"> The central task of the Discovery Phase is determining patterns that express the target relation.</Paragraph> <Paragraph position="1"> These patterns are generalized in the Training Phase. In the Testing Phase, the patterns are used to produce the output pairs. Since the linguistic meaning of the patterns is not apparent to the system, the Discovery Phase relies on the following hypothesis: Whenever an example pair appears in a sentence, the linkage and the corresponding pattern express the target relation. This hypothesis may fail if a sentence contains an example pair merely by chance, i.e. without expressing the target relation. Analogously, a pattern that does express the target relation may occasionally produce counterexamples. We call these patterns false samples. Virtually any learning algorithm can deal with a limited number of false samples.</Paragraph> <Paragraph position="2"> To show that our approach does not depend on a specific learning algorithm, we implemented two classifiers for LEILA: One is an adaptive k-Nearest-Neighbor-classifier (kNN) and the other one uses a Support Vector Machine (SVM). These classifiers, the feature selection and the statistical model are explained in detail in (Suchanek et al., 2006). Here, we just note that the classifiers yield a real valued label for a test pattern. This value can be interpreted as the confidence of the classification. Thus, it is possible to rank the output pairs of LEILA by their confidence.</Paragraph> <Paragraph position="3"> 2Note that different patterns can match the same linkage.</Paragraph> </Section> </Section> class="xml-element"></Paper>