File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2612_metho.xml
Size: 28,709 bytes
Last Modified: 2025-10-06 14:09:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2612"> <Title>FBI Uniform Crime Reporting: Data Collection Guidelines</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Problem Statement, Motivation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Same Word, Many Different Definitions </SectionTitle> <Paragraph position="0"> What is the meaning of words and phrases? What concepts do they denote? Different sources of definitional knowledge, including dictionaries, encyclopedias, various texts, and people sharing their personal beliefs, define common words such as &quot;document&quot; and less common such as &quot;virus&quot; quite differently. Their definitions differ significantly in terms of length, properties (dimensions of information), their significance, levels of specificity, the number of different senses; see Tables 1-5.</Paragraph> <Paragraph position="1"> esp. of an official or legal nature.</Paragraph> <Paragraph position="2"> 2. qual Archaic. evidence; proof.</Paragraph> <Paragraph position="3"> SourceD2 1. anything printed, written, etc., relied upon to record or prove something 2. anything serving as proof SourceD7 1. writing that provides information (especially information of an official nature) 2. anything serving as a representation of a person's thinking by means of symbolic marks 3. a written account of ownership or obligation 4. (computer science) a computer file that contains text (and possibly formatting instructions) using 7-bit ASCII characters one human being by another. Also, any killing done while committing some other felony, as rape or robbery. SourceP3 1. Killing someone without justifications defined by society.</Paragraph> <Paragraph position="4"> SourceP4 1. The act of killing a living being is called murder. This is a crime and is against the ethics of human life. SourceP5 1. Killing a human.</Paragraph> <Paragraph position="5"> SourceA1 1. The willful (nonnegligent) killing of one human being by another.</Paragraph> <Paragraph position="6"> With any information and knowledge, the reasons for differences include incompleteness and lack of knowledge, errors, lies, and misinformation, subjectivity, specific processing needs that deem certain characteristics and details as relevant and important.</Paragraph> <Paragraph position="7"> Additionally, such big differences exist because it appears that natural languages are inherently ambiguous and context-dependent. Roughly, different sources give different definitions because they consider different contexts. Further complication is that words and phrases of natural language change their meanings with time. There are also regional differences.</Paragraph> <Paragraph position="8"> Table 3. STUDENT according to different sources SourceD1 1. a person following a course of study, as in a school, college, university, etc.</Paragraph> <Paragraph position="9"> 2. a person who makes a thorough study of a subject 3. a person who likes to study SourceD2 1. a person who studies or investigates 2. a person who is enrolled for study at a school, college, etc.</Paragraph> <Paragraph position="10"> SourceD5 1. One who is enrolled or attends classes at a school, college, or university.</Paragraph> <Paragraph position="11"> 2. a. One who studies something. b. An attentive observer.</Paragraph> <Paragraph position="12"> SourceD7 1. a learner who is enrolled in an educational institution 2. a learned person (especially in the humanities); someone who by long study has gained mastery in one or more disciplines</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Important to Know Right Definitions </SectionTitle> <Paragraph position="0"> This situation creates a major difficulty for designers of general-purpose natural language processing (NLP) systems. An in-depth interpretation of natural language requires a component providing lexical knowledge, a dictionary or knowledge base kind of resource. Text processing applications involving classification, summarization, or question answering may produce very different results depending on which definition will be used.</Paragraph> <Paragraph position="1"> a single nucleic acid surrounded by a protein coat and capable of replication only within the cells of animals and plants; many are pathogenic.</Paragraph> <Paragraph position="2"> 2. a disease caused by a virus.</Paragraph> <Paragraph position="3"> 3. any corrupting or infecting influence SourceD2 1. orig., venom, as of a snake 2. a. same as FILTERABLE VIRUS; specif., any of a group of ultramicroscopic or submicroscopic infective agents that cause various diseases in animals, as measles, mumps, etc., or in plants, as mosaic diseases; viruses are capable of multiplying only in connection with living cells and are regarded both as living organisms and as complex proteins sometimes involving nucleic acid, enzymes, etc. b. a disease caused by a virus 3. anything that corrupts or poisons the mind or character; evil or harmful influence 4. something that poisons the mind or soul 5. a computer program usually hidden within another seemingly innocuous program that produces copies of itself and inserts them into other programs and that usually performs a malicious action (as destroying data) SourceD3 1. a very small organism, smaller than a bacterium, which causes disease in humans, animals and plants 2. Virus also means a disease caused by a virus.</Paragraph> <Paragraph position="4"> 3. a hidden instruction in a computer program which is intended to introduce faults into a computer system and in so doing destroy information stored in it SourceC2 1. Viruses are extremely small infectious substances (much smaller than bacteria).</Paragraph> <Paragraph position="5"> For example, the property of 'liking to study' and the property of 'being enrolled at school' have a potential to classify individuals as &quot;students&quot; completely differently; see definitions of &quot;student&quot; according to SourceD1 and SourceD5 in Table 3.</Paragraph> <Paragraph position="6"> A person who understands &quot;murder&quot; as 'killing a human', see SourceP5 in Table 2, may develop a false sense of security when reading FBI statistics compiled with a different, more restrictive definition of &quot;murder&quot; which excludes certain types of killing a human from being classified as &quot;murder&quot;; FBI is SourceA1 in Table 2.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.3 Many Competing Sources </SectionTitle> <Paragraph position="0"> The question arises as to which of these many definitions is the right one, the most correct and complete, and which of the many available sources should be used for building a lexical knowledge component of a NLP system, be it a dictionary or a knowledge base.</Paragraph> <Paragraph position="1"> Many NLP researchers and practitioners have built and continue to build their own dictionaries/knowledge bases, which tends to be a very long and costly effort requiring serious resources. Another problem is that self-developed resources are virtually always geared toward specific applications and type of textual data processed, which contributes to the nonscalability of NLP systems.</Paragraph> <Paragraph position="2"> Many researchers utilize existing sources. WordNet (Fellbaum, 1998) is a wonderful and free-of-charge resource designed specifically for the needs of computational linguistics (CL) community and the dictionary of choice for many NLP systems (Voorhees and Buckland, 2002). It is not, however, the only, the best, or the most comprehensive source. There are hundreds of other sources of lexical definitional knowledge available at, among others, OneLook.com</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> and YourDictionary.com Dictionary Search websites. </SectionTitle> <Paragraph position="0"> A promising recent approach pursued by a number of NLP and CL researchers is developing knowledge acquisition and learning methods to automatically create dictionaries and knowledge bases or augment the existing ones with system-acquired knowledge from corpora of texts (Iwanska et al., 1999, 2000a), (Harabagiu and Moldovan, 2000), (Rapaport and Kibby, 2002), (Reiter and Robertson,2003), (Thompson and Mooney, 2003).</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1.4 Need for Comparison, Integration </SectionTitle> <Paragraph position="0"> Given the variety of sources and definitions for virtually all words and phrases, a comparison mechanism is needed in order to address the question as to which of the sources is the best, the most complete and correct, which definition(s) to use, and, if multiple definitions are valid, in order to identify their similarities and differences.</Paragraph> <Paragraph position="1"> We developed a computational mechanism to automatically compare and, in some cases, integrate knowledge from multiple sources. Given two definitions of a word or phrase, our system computes quantitative measure of distance between them based on qualitative relations between these definitions: PARTIAL-OVERLAP, MORE-SPECIFIC / MORE-GENERAL, DISJOINT. It highlights similarities and differences.</Paragraph> <Paragraph position="2"> Computed comparison is used to reach the integrate-or-not decision. If integration is deemed appropriate, the system computes integrated definitions.</Paragraph> <Paragraph position="3"> In our NLP system, we address incompleteness and changes in meaning through integration of our handcrafted, modest size dictionary with definitions from reliable sources. Our primary sources include existing &quot;respectable&quot; dictionaries, see Table 5, and knowledge acquired automatically by our system from corpora of &quot;respectable&quot; texts.</Paragraph> <Paragraph position="4"> Automatic knowledge acquisition methods are particularly useful for acquiring and updating phrasal definitional knowledge. For example, none of the above mentioned hundreds dictionaries define phrases such as &quot;safe environment&quot; or &quot;very fast actions&quot;, both of which were learned by our system (Iwanska et al., 1999, 2000a).</Paragraph> <Paragraph position="5"> Additionally, knowledge acquired from recent texts allows our system to update definitions that changed with time. For example, the fourth definition of &quot;document&quot; given by SourceD7 (WordNet), probably about ten years ago, is now too restrictive. Currently, any character, not just 7-bit ASCII character, can be used in a document. Knowledge acquired by our system allowed us to correctly generalize this definition to account for this change.</Paragraph> <Paragraph position="6"> The capability of comparing and integrating lexical knowledge results in improved performance of our NLP system. For example. In question answering, new questions can be answered, correctness of some answers is improved, and some questions can be answered more completely. In tasks involving classification, groupings arrived via different definitions may be compared and predicted.</Paragraph> <Paragraph position="7"> The rest of the paper is organized as follows: Sect. 2 provides a high-level discussion of our meaning and knowledge-level representation of text; Sect. 3 gives algorithmic details of our comparison and integration approach; it also provides a number of examples; Sect.</Paragraph> <Paragraph position="8"> 4 and 5 discuss reliable and unreliable sources and more details about our integration mechanism.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 NL-Motivated Representation of Text </SectionTitle> <Paragraph position="0"> We discuss briefly our natural language-motivated representation of text. Further details, including question answering, representation and reasoning with text conveying spatio-temporal and probabilistic information and knowledge can be found in (Iwanska, 1993), (Iwanska, 1996), (Iwanska, 2000b).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Text as Sets of Type Equations </SectionTitle> <Paragraph position="0"> We represent text by natural language-motivated type equations with Boolean, set and interval-theoretic semantics of the following form</Paragraph> <Paragraph position="2"> where P's are properties corresponding to text fragments such as noun phrases and verb phrases.</Paragraph> <Paragraph position="3"> Each property is a term, a record-like, graph-like, underspecified structure that consist of two elements 1. head, a type symbol, and 2. body, a possibly empty list of attribute-value pairs attribute => value where attributes are symbols and values are single terms or sets of terms. For example, the sentence &quot;Viruses are extremely small infectious substances&quot; is represented by the</Paragraph> <Paragraph position="5"> whose right handside contains one property, a term with &quot;substance&quot; as its head and two attributes: 1. the attribute &quot;size&quot; with the value small(degree => extremely) which itself is a term with the type &quot;small&quot; as its head, and one attribute &quot;degree&quot; with the value &quot;extremely&quot;. 2. the attribute &quot;infect&quot; with the value &quot;infectious&quot; which is a basic type.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Boolean, Set and Interval-Theoretic </SectionTitle> <Paragraph position="0"> Semantics Motivated by Natural Language Semantically, terms are subtypes of their head types. For example, the above term represents this subset of things of the type &quot;substance&quot; for which the attribute &quot;size&quot; has the value &quot;extremely small&quot; and for which the function &quot;infect&quot; yields the value &quot;infectious&quot;. The Boolean operations of MEET, JOIN, and COMPLEMENT simulate conjunction, disjunction and negation in natural language. They take terms as arguments and compute conjunctive, disjunctive, and complementary terms with the set-intersection, setunion, and set-complement semantics.</Paragraph> <Paragraph position="1"> Efficient computation of arbitrary Boolean expressions allows the system to compute a number of semantics relations among terms, including EQUAL reflecting set identity, ENTAILMENT (and SUBSUMPTION, its dual) reflecting set-inclusion, PARTIAL-OVERLAP, reflecting non-empty setintersection, DISJOINT reflecting empty setintersection. These relations allow the system to compute consequences of knowledge expressed by text, and therefore compute answers to questions of the knowledge base created as the result of processing input texts, and to update system's knowledge base.</Paragraph> <Paragraph position="2"> Knowledge bases with such type equations are used bi-directionally: for answering questions about the properties of entities and concepts in the left handsides, and for matching particular properties against the right handside properties of entities and concepts that the system knows about. We use these capabilities to compute comparison as well as integration of properties in different concept definitions.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Algorithmic Details </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Input </SectionTitle> <Paragraph position="0"> 1. Concept C, a word or phrase. For example, we may be concerned with the meaning (concept definition) of the word &quot;virus&quot; or the phrase &quot;very fast actions&quot;. 2. Two knowledge sources Source1 and Source2. Our sources of definitional knowledge include dictionaries, encyclopedias, personal beliefs obtained via knowledge engineering methods, and knowledge automatically acquired by our NLP system from corpora of texts; see Table 5.</Paragraph> <Paragraph position="1"> 3. Concept definitions according to both sources</Paragraph> <Paragraph position="3"> is text, some number of sentences or phrases such as noun phrases or verb phrases. For example, if the word is &quot;virus&quot; and we consider SourceD1 as Source1, and SourceD2 as Source2, then N=3 and M=6, i.e., we have three definitions of &quot;virus&quot; from Source1 { T }. These definitions correspond to different senses; note that SourceD2 distinguishes two senses 2a., 2b.; see Tables 4 and 5.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Steps </SectionTitle> <Paragraph position="0"> Step 1 Compute representations of word or phrase C and of each of its textual definitions T</Paragraph> <Paragraph position="2"> ) of definitions from both sources, compute qualitative relation R between each pair of properties ( P</Paragraph> <Paragraph position="4"> both sources, compute numeric measure of closeness D between two definitions.</Paragraph> <Paragraph position="5"> This measure whose motivation is similar to (Resnik 1999) is a number between 0 and 1 computed based on qualitative relations R among the properties in both definitions and on proportion of relations indicating closeness; EQUAL corresponds to 1, the smallest distance, SMALLER and LARGER to 0.8, PARTIAL-OVERLAP to 0.6, and DISJOINT to 0, the largest distance.</Paragraph> <Paragraph position="6"> Step 4 Compute alignment of definitions based on metric D computed for each pair. This alignment shows which definitions from both sources resemble each other most closely. For the definitions of &quot;virus&quot; according to SourceD1 and SourceD2, see Table 4, this alignment is ((1, 2a), (2, 2b), (3, 3), (-, 1), (-, 4), (-, 5)). Step 5 For each pair of aligned definitions, decide if integrate and choose integration mode based on the reliability of sources and on the value of D.</Paragraph> <Paragraph position="7"> Step 5a Compute integrated definition. This integration, illustrated by examples in Sections 4 and 5, involves computing the Boolean operations of meet (conjunction), join (disjunction), and complement (negation) on the properties in the right handside of the definitions.</Paragraph> <Paragraph position="8"> Step 5b Generate English text for the integrateddefinition.</Paragraph> <Paragraph position="9"> Step 5c Update system dictionary/knowledgebase with the integrated definition.</Paragraph> <Paragraph position="10"> 3.3 Output 1. Updated system dictionary/knowledgebase incorporating knowledge from both sources.</Paragraph> <Paragraph position="11"> 2. Alignment of definitions 3. Highlights of similarities and differences between pairs of definitions.</Paragraph> <Paragraph position="12"> 4 Reliable and Unreliable Sources Depending whether sources are reliable or not (in general or in terms of specific piece of information or knowledge), we use different integration operations. If both are reliable, we integrate most aggressively and the resulting integrated piece reflects fully all that both sources provided. If one source may not be reliable, a conservative integration is performed. Finally, if a source is known or suspected to be unreliable, we first negate its information and then fully combine it with all provided by the reliable source.</Paragraph> <Paragraph position="13"> Consider temporal information about the occurence of an event provided by two sources, different people recalling the same event.</Paragraph> <Paragraph position="14"> Source1: &quot;It took place in 1992, April or May&quot; Source2: &quot;It did not happen in early May&quot; Depending whether these sources are considered reliable or not, we combine their information differently, which results in three possible integrated information about the time when the event took place. Information provided by the sources is translated into the following terms</Paragraph> <Paragraph position="16"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Both Sources Reliable </SectionTitle> <Paragraph position="0"> If both sources are considered reliable, we use the meet operation to compute integrated piece of information or knowledge. This operation, a conjunction with inheritance, incorporates fully all information provided by both sources. For the above dates, an integrated term is computed</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 One Source Possibly Unreliable </SectionTitle> <Paragraph position="0"> If one source may not be reliable, but it is not known which one, we use the join operation to integrate. This operation, a disjunction, incorporates conservatively information provided by both sources. For the above dates, the integrated term cannot be simplified, its two elements are partially overlapping because both sources provide different aspects of the temporal information.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 One Source Reliable, One Unreliable </SectionTitle> <Paragraph position="0"> If one source is considered unreliable, eg. it is known or suspected to have lied or to be ignorant, we use the complement operation to negate its information. The rationale is that if information or piece of knowledge is incorrect, then the actual correct information and knowledge, whatever it may be, is consistent with the negation of what the source provided. The complement operation allows us to capture this. We then integrate both terms via the meet operation. For the above dates, the system computes an integrated</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Partially Overlapping Concepts </SectionTitle> <Paragraph position="0"> Definitions from different sources frequently denote partially overlapping concepts. Overlap exists because properties are described at different levels of specificity and because some properties are stated only by one source. If both sources are reliable, we mostly use the most aggressive mode of integration, which combines all knowledge provided by both sources. In the integrated definition, some properties become more specialized (more informative) and some other new properties are added.</Paragraph> <Paragraph position="1"> An example is a dictionary definition which we update with knowledge acquired from texts. As shown in Table 4, SourceD3 defines &quot;virus&quot; as &quot;a very small organism, smaller than a bacterium, which causes disease in humans, animals and plants&quot;, and SourceC1 as &quot;extremely small infectious substances (much smaller than bacteria)&quot;.</Paragraph> <Paragraph position="2"> The integration of the first definition with the second produces an integrated definition &quot;an extremely small, infectious organism (substance), much smaller than a bacterium, which causes disease in humans, animals and plants&quot;. Two size-related properties get more specialized: &quot;very small&quot; becomes &quot;extremely small&quot;, and &quot;smaller&quot; becomes &quot;much smaller&quot;. These integrated properties contain strictly more information than (entail) the corresponding properties in the old definition. The new property added is &quot;infectious&quot;. This is accomplished as follows.</Paragraph> <Paragraph position="3"> First, the representation of definitions is computed = smaller(quantity => much, Then, relations R for each pair of properties in the right handside of the equations are computed via the meet operation.</Paragraph> <Paragraph position="4"> The relations R for the other pairs of properties are DISJOINT because the meet operation yields terms corresponding to empty set. D = 2/3 and in the COMBINE-ALL integration mode, the integrated type equation has three properties: the integrated properties This equation then gets translated into English phrase &quot;an extremely small, infectious organism (substance), much smaller than a bacterium, which causes disease in humans, animals and plants&quot;, in which the order of properties mentioned follows the order in the original definition.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Concepts in MORE-GENERAL Relation </SectionTitle> <Paragraph position="0"> Definitions from different sources may denote concepts in MORE-GENERAL (LARGER) relation. For example, as the following equations reveal, SourceP3 definition is strictly more general, i.e., denotes larger set, than definitions from SourceD3 and SourceA1.</Paragraph> <Paragraph position="1"> Such a relation may indicate that one source has a definition that is too general due to, for example, ignorance. It can also indicate that a source has a definition that is overly specific, i.e., not generalized enough. We do not have means to automatically decide which is the case. In certain cases, we make somewhat arbitrary assumptions.</Paragraph> <Paragraph position="2"> For example, if two dictionary definitions are in MORE-GENERAL relation, we integrate by keeping the most specific. Then, if context requires certain properties at given level of specificity, we generate shorter, more general definitions via our summarization/generalization mechanism. In case of personal beliefs, unless a person is known to be an expert, we assume that sources such as dictionaries and texts are more correct and integrate accordingly.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Clashes Signal Need to Generalize, Correct </SectionTitle> <Paragraph position="0"> Clashes between information and knowledge from different sources indicate inconsistencies that need to be resolved. In our representation, inconsistencies are automatically detected when the meet operation generates a term corresponding to empty set.</Paragraph> <Paragraph position="1"> Some clashes indicate the need to generalize, others reflect errors or deliberate misrepresentations that need to be corrected. We have a mechansim to identify clashes, but we do not have automatic way to decide what to do about them. Each time the system generates a clash, a human has to make the decision what to do about the clash.</Paragraph> <Paragraph position="2"> This situation is reminiscent of expert systems and knowledge-based systems in that the decision which piece of knowledge or which expert is correct does not appear to have a general solution and involves rather arbitrary assumptions and trust.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Integrating Two Word Senses Into One </SectionTitle> <Paragraph position="0"> The same source, eg. dictionary, can be used as if two sources, which allows us to investigate similarities and differences between different senses of the same word or phrase. In some cases, similarities lead to integrating two senses into one, thus reducing the number of word senses.</Paragraph> <Paragraph position="1"> For example, similarity between partially overlapping senses of &quot;virus&quot; , see Table 4 for definitions 3 and 4 from SourceD2, led to one combined sense. The original two senses &quot;anything that corrupts or poisons the mind or character&quot; and &quot;something that poisons the mind or soul&quot; are represented as follows virus == [ corrupt, poison ](object => [ mind, character ]) . virus == poison ](object => [ mind, soul ]) .</Paragraph> <Paragraph position="2"> The conservative, join operation integration combined with a machine learning-style inductive leap (add or skip some aspect in order to simplify and/or shorten the utterance) results in one combined word sense which corresponds to the first original sense.</Paragraph> </Section> </Section> class="xml-element"></Paper>