File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0625_metho.xml
Size: 9,460 bytes
Last Modified: 2025-10-06 14:15:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0625"> <Title>Normalized? Yes Yes Yes No Yes No Yes Yes Yes Yes Yes Yes</Title> <Section position="5" start_page="204" end_page="206" type="metho"> <SectionTitle> 4 Methodology </SectionTitle> <Paragraph position="0"> We compute a feature vector over a pair of textual units, where features are either primitive, consisting of one characteristic, or composite, consisting of pairs of primitive features.</Paragraph> <Section position="1" start_page="204" end_page="204" type="sub_section"> <SectionTitle> 4.1 Primitive Features </SectionTitle> <Paragraph position="0"> Our features draw on a number of linguistic approaches to text analysis, and are based on both single words and simplex noun phrases (head nouns preceded by optional premodifiers but with no embedded recursion). Each of these morphological, syntactic, and semantic features has several variations. We thus consider the following potential matches between text units: * Word co-occurrence, i.e., sharing of a single word between text units. Variations of this feature restrict matching to cases where the parts of speech of the words also match, or relax it to cases where just the stems of the two words are identical.</Paragraph> <Paragraph position="1"> * Matching noun phrases. We use the LINKIT tool \[Wacholder 1998\] to identify simplex noun phrases and match those that share the same head.</Paragraph> <Paragraph position="2"> * WordNet synonyms. WordNet \[Miller et al. 1990\] provides sense information, placing words in sets of synonyms (synsets). We match words that appear in the same synset.</Paragraph> <Paragraph position="3"> Variations on this feature restrict the words considered to a specific part-of-speech class.</Paragraph> <Paragraph position="4"> * Common semantic classes for verbs.</Paragraph> <Paragraph position="5"> Levin's \[1993\] semantic classes for verbs have been found to be useful for determining document type and text similarity \[Klavans and Kan 1998\]. We match two verbs that share the same semantic class.</Paragraph> <Paragraph position="6"> * Shared proper nouns. Proper nouns are identified using the ALEMBIC tool set \[Aberdeen et al. 1995\]. Variations on proper noun matching include restricting the proper noun type to a person, place, or an organization (these subcategories are also extracted with ALEMBIC's named entity finder).</Paragraph> <Paragraph position="7"> In order to normalize for text length and frequency effects, we experimented with two types of optional normalization of feature values. The first is for text length (measured in words), where each feature value is normalized by the size of the textual units in the pair. Thus, for a pair of textual units A and B, the feature values are divided by:</Paragraph> <Paragraph position="9"> This operation removes potential bias in favor of longer text units.</Paragraph> <Paragraph position="10"> The second type of normalization we examined was based on the relative frequency of occurrence of each primitive. This is motivated by the fact that infrequently matching primitive elements are likely to have a higher impact on similarity than primitives which match more frequently. We perform this normalization in a manner similar to the IDF part of TF*IDF \[Salton 1989\]. Every primitive element is associated with a value which is the number of textual units in which the primitive appeared in the corpus. For a primitive element which compares single words, this is the number of textual units which contain that word in the corpus; for a noun phrase, this is the number of textual units that contain noun phrases that share the same head; and similarly for other primitive types.</Paragraph> <Paragraph position="11"> We multiply each feature's value by: Total number of textual units log Number of textual units (2) containing this primitive Since each normalization is optional, there are four variations for each primitive feature.</Paragraph> </Section> <Section position="2" start_page="204" end_page="206" type="sub_section"> <SectionTitle> 4.2 Composite Features </SectionTitle> <Paragraph position="0"> In addition to the above primitive features that compare single items from each text unit, we use composite features which combine pairs of primitive features. Composite features are defined by placing different types of restrictions on the participating primitive features: An OH-58 helicopter, carrying a crew of~lwas on a routine training (a) orientation when as~st out 11:30 a.m. Saturday (9:30 p.m. EST F r~y~ ~-_~ -~___~_ (b) &quot;There weret~eople on board,&quot; said Bacon. &quot;We lost radar~ with the helicopter about 9:15 EST (0215 GMT).&quot; Figure 2: A composite feature over word primitives with a restriction on order would count the pair &quot;two&quot; and &quot;contact&quot; as a match because they occur with the same relative order in both textual units.</Paragraph> <Paragraph position="1"> An OH-58 helicopter, carrying a crew of two, was on a routine training orientation wheni4~.~_.~,about 11:30 a.m. Saturday (a) (9:30 p.m. EST Friday).</Paragraph> <Paragraph position="2"> (b) &quot;There were two ~ with the helicopter about 9:15 EST (0215 GMT).&quot; Figure 3: A composite feature over word primitives with a restriction on distance would match on the pair &quot;lost&quot; and &quot;contact&quot; because they occur within two words of each other in both textual units.</Paragraph> <Paragraph position="3"> arrying a crew of two, was on a routine training ~orientation when contact was~at about 11:30 a.m. Saturday (a) (~ 0 p'm&quot; EST Friday)&quot; (b) &quot;T ere we~eople on board,&quot; said Bacon. ilWe~radar contact a matching simplex noun phrase (in this case, a helicopter), while the other primitive must be a matching verb (in this case, &quot;lost&quot; .) The example shows a pair of textual units where this composite feature detects a valid match.</Paragraph> <Paragraph position="4"> * Ordering. Two pairs of primitive elements are required to have the same relative order in both textual units (see Figure 2).</Paragraph> <Paragraph position="5"> * Distance. Two pairs of primitive elements are required to occur within a certain distance in both textual units (see Figure 3).</Paragraph> <Paragraph position="6"> The maximum distance between the primitire elements can vary as an additional parameter. A distance of one matches rigid collocations whereas a distance of five captures related primitives within a region of the text unit \[Smeaton 1992; Smadja 1993\].</Paragraph> <Paragraph position="7"> * Primitive. Each element of the pair of primitive elements can be restricted to a specific primitive, allowing more expressiveness in the composite features. For example, we can restrict one of the primitive features to be a simplex noun phrase and the other to be a verb; then, two noun phrases, one from each text unit, must match according to the rule for matching simplex noun phrases (i.e., sharing the same head), and two verbs must match according to the rule for verbs (i.e., sharing the same semantic class); see Figure 4.1 This particular combination loosely approximates grammatical relations, e.g., matching subject-verb pairs.</Paragraph> <Paragraph position="8"> 1Verbs can also be matched by the first (and more restrictive) rule of Section 4.1, namely requiring that their stemmed forms be identical.</Paragraph> <Paragraph position="9"> Since these restrictions can be combined, many different composite features can be defined, although our empirical results indicate that the most successful tend to include a distance constraint. As we put more restrictions on a composite feature, the fewer times it occurs in the corpus; however, some of the more restrictive features are most effective in determining similarity. Hence, there is a balance between the discriminatory power of these features and their applicability to a large number of cases.</Paragraph> <Paragraph position="10"> Composite features are normalized as primitive features are (i.e., for text unit length and for frequency of occurrence). This type of normalization also uses equation (2) but averages the normalization values of each primitive in the composite feature.</Paragraph> </Section> <Section position="3" start_page="206" end_page="206" type="sub_section"> <SectionTitle> 4.3 Learning a Classifier </SectionTitle> <Paragraph position="0"> For each pair of text units, we compute a vector of primitive and composite feature values.</Paragraph> <Paragraph position="1"> To determine whether the units match overall, we employ a machine learning algorithm, RIPPER \[Cohen 1996\], a widely used and effective rule induction system. RIPPER is trained over a corpus of manually marked pairs of units; we discuss the specifics of our corpus and of the annotation process in the next session. We experiment with varying RIPPER's loss ratio, which measures the cost of a false positive relative to that of a false negative (where we view &quot;similar&quot; as the positive class), and thus controls the relative weight of precision versus recall. This is an important step in dealing with the sparse data problem; most text units are not similar, given our restrictive definition, and thus positive instances are rare.</Paragraph> </Section> </Section> class="xml-element"></Paper>