File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1006_metho.xml

Size: 16,499 bytes

Last Modified: 2025-10-06 14:14:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="M98-1006">
  <Title>Using Collocation Statistics in Information Extraction</Title>
  <Section position="2" start_page="0" end_page="3" type="metho">
    <SectionTitle>
COLLOCATION DATABASE
</SectionTitle>
    <Paragraph position="0"> We de#0Cne a collocation to be a dependency triple that consists of three #0Celds: #28word, relation, relative#29 where the word #0Celd is a word in a sentence, the relative #0Celd can either be the modi#0Cee or a modi#0Cer of word, and the relation #0Celd speci#0Ces the type of the relationship between word and relative as well as their parts of speech.</Paragraph>
    <Paragraph position="1"> For example, the dependency triples extracted from the sentence #5CI have a brown dog&amp;quot; are:  The identi#0Cers for the dependency types are explained in Table 1.</Paragraph>
    <Paragraph position="2">  Label Relationship between: N:det:D a noun and its determiner N:jnab:A a noun and its adjectival modi#0Cer N:nn:N a noun and its nominal modi#0Cer V:comp1:N averb and its noun object V:subj:N averb and its subject V:jvab:A averb and its adverbial modi#0Cer We used MINIPAR, a descendent of PRINCIPAR #5B2#5D, to parse a text corpus that is made up of 55-million-word Wall Street Journal and 45-million-word San Jose Mercury. Two steps were taken to reduce the number of errors in the parsed corpus. Firstly, only sentences with no more than 25 words are fed into the parser. Secondly, only complete parses are included in the parsed corpus. The 100 million word text corpus is parsed in about 72 hours on a Pentium 200 with 80MB memory. There are about 22 million words in the parse trees.</Paragraph>
    <Paragraph position="3"> Figure 1 shows an example entry in the resulting collocation database. Eachentry contains of all the dependency triples that have the same word #0Celd. The dependency triples in an entry are sorted #0Crst in the order of the part of speech of their word #0Celds, then the relation #0Celd, and then the relative #0Celd.</Paragraph>
    <Paragraph position="4"> The symbols used in Figure #281#29 are explained as follows. Let X be a multiset. The symbol kXk stands for the number of elements in X and jXj stands for the number of distinct elements in X.For example, a. k#28review, V:comp1:N, acquisition#29k is the number of times #5Cacquisition&amp;quot; is used as the object of the verb #5Creview&amp;quot;.</Paragraph>
    <Paragraph position="5"> b. k#28review, *, *#29k is the number of dependency triples in which the word #0Celd is #5Creview&amp;quot; #28which can be a noun or a verb#29.</Paragraph>
    <Paragraph position="6"> c. k#28review, V:jvab:A, *#29k is the number of times #5B v review#5D is pre-modi#0Ced by an adverb.</Paragraph>
    <Paragraph position="7"> d. j#28review, V:jvab:A, *#29j is the number of distinct adverbs that were used as a pre-modi#0Cer of #5B v review#5D.</Paragraph>
    <Paragraph position="8"> e. k#28*, *, *#29k is the total number of dependency triples, whichistwice the number of dependency relationships in the parsed corpus.</Paragraph>
    <Paragraph position="9"> f. k#28review, N#29k is the number of times the word #5Creview&amp;quot; is used as a noun. g. k#28*, N#29k is the total number of occurrences of nouns.</Paragraph>
    <Paragraph position="10"> h. j#28*, N#29j is the total number of distinct nouns that</Paragraph>
    <Paragraph position="12"> &amp;quot;review&amp;quot; has been used as the objects of these verbs in the corpus these nouns were used as a prenominal modifier of &amp;quot;review&amp;quot; ||(review, N)||  i. k#28review, *#29k is the total numberof occurrences of the word #5Creview&amp;quot; #28used as any category#29 in the parsed corpus.</Paragraph>
  </Section>
  <Section position="3" start_page="3" end_page="5" type="metho">
    <SectionTitle>
NAMED ENTITY RECOGNITION
</SectionTitle>
    <Paragraph position="0"> Our named entity recognizer is a #0Cnite-state pattern matcher, whichwas developed as part University of Manitoba MUC-6 e#0Bort. The pattern matcher has access to both lexical items and surface strings in the input text. In MUC-7, we extended the earlier system in twoways: #0F We extracted recognition rules automatically from the collocation database to augment the manually coded pattern rules.</Paragraph>
    <Paragraph position="1"> #0F We treated the collocational context of words in the input texts as features and used a Naive-Bayes classi#0Cer to categorized unknown proper names, which are then inserted into the systems lexicon.</Paragraph>
    <Paragraph position="2"> A collocational context of a proper name is often a good indicator of its classi#0Ccation. For example, in the 22-million-word corpus, there are 33 instances where a proper noun is used as a prenominal modi#0Cer of #5Cmanaging director&amp;quot;. In 26 of the 33 instances, the proper name was classi#0Ced as an organization. In the remaining 7 instances, the proper name was not classi#0Ced. Therefore, if an unknown proper name is a prenominal modi#0Cer of #5Cmanaging director&amp;quot;, it is likely to refer to an organization. We extracted 3623 such contexts in which the frequency of one type of proper names is much greater #28as de#0Cned by a rather arbitrary threshold#29 than the frequencies of other types of proper names. If a proper name occurs in one of these contexts, we can then classify it accordingly. This use of the collocation database is equivalent to automatic generation of classi#0Ccation rules. In fact, some of the collocational contexts are equivalent to pattern-matching rules that were manually coded in the system.</Paragraph>
    <Paragraph position="3"> There are only a small number of collocational contexts in which the classi#0Ccation of a proper name can be reliablydetermined. In most cases, a clear decision cannot be reached based on a single collocational context. For example, among 1504 objects of #5Cconvince&amp;quot;, 49 of them were classi#0Ced as organizations, and 457 of them were classi#0Ced as persons. This suggests that if a proper name is used as the object of #5Cconvince&amp;quot;, it is likely that the name refers to a person. However, there is also signi#0Ccant probability that the name refers to an organization. Instead of making the decision based on this single piece of evidence, we collect from the input texts all the collocational contexts in which an unknown proper names occurred. We then classify the the proper name with a naive Bayes classi#0Cer, using the the set of collocation contexts as features.</Paragraph>
    <Paragraph position="4"> The naive Bayes classi#0Cer uses a table to store the frequencies of proper name classes in collocational contexts. Sample entries of the frequency table are shown in Table 2. Each row in the table represents a collocation feature. The #0Crst column is a collocation feature. Words with this feature have been observed to occur at position X in the second column. The third to #0Cfth columns contain the frequencies of di#0Berent proper name classes.</Paragraph>
    <Paragraph position="5"> Let C be a class of proper name #28C is one of LOC, ORG, or PER#29. Let F</Paragraph>
    <Paragraph position="7"> be a collocation feature. Classi#0Ccation decision is made by #0Cnd the class C that maximizes</Paragraph>
    <Paragraph position="9"/>
    <Paragraph position="11"> are the features of an unknown proper name. The probability P#28F</Paragraph>
    <Paragraph position="13"> as the parameters, where CF is the set of collocation features:</Paragraph>
    <Paragraph position="15"> ;Ckdenotes the frequency of words that belong to C in the context represented by f.</Paragraph>
    <Paragraph position="16"> Example: The walkthrough article contains several occurrences of the word #5CXichang&amp;quot; whichis not found in our lexicon. The parser extracted the following set of collocation contexts from the formal testing corpus:  1. #5Cthe Xichang base&amp;quot;, whereXichangisusedasthe prenominalmodi#0Cerof#5Cbase&amp;quot; #28base|N:nn:N#29; 2. #5Cthe Xichang site&amp;quot;, where Xichang isusedas the prenominalmodi#0Cerof #5Csite&amp;quot; #28site|N:nn:N#29; 3. #5Cthe site in Xichang&amp;quot;, from whichtwo features are extracted: #0F the object of #5Cin&amp;quot; #28in|P:pcomp:N#29; #0F indirect modi#0Cer of #5Csite&amp;quot; via the preposition #5Cin&amp;quot; #28site|N:pnp-in:N#29.  The frequencies of the features are shown in Table 3. These features allowed the naive Bayes classi#0Cer to correctly classify #5CXichang&amp;quot; as a locale.</Paragraph>
    <Paragraph position="17"> Automatically acquiring lexical information on the #0Dy is an double edged sword. On the one hand, it allows classi#0Ccation of proper names that would otherwise be unclassi#0Ced. On the other hand, since there is no human con#0Crmation, the correctness of the automatically acquired lexical items cannot be guaranteed. When incorrect information is entered into the lexicon, a single error may propagate to many places. For example, during the development of our system, a combination</Paragraph>
    <Paragraph position="19"> of parser errors and the naiveBayes classi#0Ccation caused the word #5CI&amp;quot; to be added into the lexicon as a personal name. During the second pass, 143 spurious personal names were generated.</Paragraph>
    <Paragraph position="20"> Our NE evaluation results are shown in Table 4. The #5Cpass1&amp;quot; results are obtained by manually coded patterns in conjunction with the classi#0Ccation rules automatically extracted from the collocation database. With the naive Bayes classi#0Ccation, the recall is boosted by 6 percent while the precision is decreased by 2#25 with an overall increase of F-measure by 2.67.</Paragraph>
  </Section>
  <Section position="4" start_page="5" end_page="9" type="metho">
    <SectionTitle>
COREFERENCE
</SectionTitle>
    <Paragraph position="0"> Our coreference recognition subsystem used the same constraint-based model as our MUC-6 system.</Paragraph>
    <Paragraph position="1"> This model consists of an integrator and a set of independent modules, suchassyntactic patterns #28e.g., copula construction and appositive#29, string matching, bindingtheory, and centering heuristics.</Paragraph>
    <Paragraph position="2"> Each module proposes weighted assertions to the integrator. There are twotypes of assertions. An equality assertion states that two noun phrases have the same referent. An inequality assertion states that two noun phrases must not have the same referent. The modules are allowed to freely contradict one another, or even themselves. The integrator use the weights associated with the assertions to resolve the con#0Dicts. A discourse model is constructed incrementally by the sequence of assertions that are sorted in descending order of their weights. When an assertion is consistent with the current model, the model is modi#0Ced accordingly. Otherwise, the assertion is ignored and the model remains the same.</Paragraph>
    <Paragraph position="3"> One of the important factors to determine whether or not two noun phrases may refer to the same entity is their semantic compatibility. A personal pronoun must refer to a person. For example, the pronoun #5Cit&amp;quot; may refer to an organization, an artifact, but not a person. A #5Cplane&amp;quot; may refer to an aircraft. A #5Cdisaster&amp;quot; may refer to a crash. In MUC-6, we used the WordNet to  determine the semantic compatibility and similaritybetween two noun phrases. However, without the ability to determine the intended sense of a word in the input text, we had to say that all senses are possible.</Paragraph>
    <Paragraph position="4">  The problem with this approach is that the WordNet, likeany other general purpose lexical resource, aims at providing broad-coverage. Consequently, it includes many usages of words that are very rare in our domain of interest. For example, one of the 8 potential senses of #5Ccompany&amp;quot; in WordNet 1.5 is a #5Cvisitor#2Fvisitant&amp;quot;, whichisahyponym of #5Cperson&amp;quot;. This usage of the word practically never happens in newspaper articles. However, its existence prevents us to make assertions that personal pronouns like #5Cshe&amp;quot; cannot co-refer with #5Ccompany&amp;quot;. In MUC-7, we developed a word sense disambiguation #28WSD#29 module, which removes some of the implausible senses from the list of potential senses. It does not necessarily narrows down the possible senses of a word instance to a single one, however.</Paragraph>
    <Paragraph position="5"> Given a polysemous word w in the input text, we take the following steps to narrowdown the possibilities for its intended meaning:  1. Retrieve collocational contexts of w from the parse trees of the input text.</Paragraph>
    <Paragraph position="6"> 2. For each collocational context of w, retrieve its set of collocates, i.e., the set of words that occurred in the same collocational context. Take the union of all the sets of collocates of w. 3. Take the intersection of the union and the set of similar words of w which are extracted automatically with the collocational database #5B4#5D. We call the words in the intersection selectors.</Paragraph>
    <Paragraph position="7"> 4. Score the set of potential senses of w by computing the similarities between senses of w and  senses of the selectors in the WordNet #5B3#5D. Remove the senses of w that received a score less than 75#25 of the highest score.</Paragraph>
    <Paragraph position="8"> Example: consider the word #5C#0Cghter&amp;quot; in the following context in the walkthrough article: ... in the multibillion-dollar deals for #0Cghter jets.</Paragraph>
    <Paragraph position="9"> WordNet lists three senses of #5C#0Cghter&amp;quot;: #0F combatant, battler, disrupter #0F champion, hero, defender, protector #0F #0Cghter aircraft, attack aircraft The disambiguation of this word takes the following steps:  2. Retrievewords from the collocation database that were also used as the prenominal modi#0Cer of #5Cjet&amp;quot; #28shown in Table 5#29. Freq is the frequency of the word in the context, LogL is the log likelihood ratio between the word and the context #5B1#5D.</Paragraph>
    <Paragraph position="10"> 3. Retrieve the similar words of #5C#0Cghter&amp;quot; from an automatically generated thesaurus: jet 0.15; guerrilla 0.14; aircraft 0.12; rebel 0.11; bomber 0.11; soldier 0.11; troop 0.10; plane 0.10; missile 0.09; force 0.09; militia 0.09; helicopter 0.09; leader 0.08; civilian 0.07; faction 0.07; pilot 0.07; airplane 0.07; insurgent 0.07; commander 0.06; tank 0.06; airliner 0.05; militant 0.05; marine 0.05; transport 0.05; reconnaissance 0.05; prisoner 0.05; artillery0.05; army 0.05; stealth 0.05; victim 0.05; terrorist 0.05; weapon 0.04; rocket 0.04; resistance 0.04; rioter 0.04; gunboat 0.04; collaborator 0.04; assailant 0.04; thousand 0.04; gunman 0.04; sympathizer 0.04; radio 0.04; submarine 0.04; attacker 0.04; youth 0.04; camp 0.04; refugee 0.04; dependent 0.04; combat 0.04; mechanic 0.04; demonstrator 0.04; personnel 0.04; movement 0.04; gunner 0.04; territory 0.04 The number after a word is the similaritybetween the word and #5C#0Cghter&amp;quot;. The intersection of the similar word list and the above table consists of: combat 0.04; reconnaissance 0.05; stealth 0.05; transport 0.05; 4. Find a sense of #5C#0Cghter&amp;quot; in WordNet that is most similar to senses of #5Ccombat&amp;quot;, #5Creconnaissance&amp;quot;, #5Cstealth&amp;quot; or #5Ctransport&amp;quot;. The #5C#0Cghter aircraft&amp;quot; sense of #5C#0Cghter&amp;quot; was selected. We submitted two sets of results in MUC-7: #0F the #5Cnowsd&amp;quot; result in which the senses of a word are chosen simply bychoosing its #0Crst two senses in the WordNet.</Paragraph>
    <Paragraph position="11"> #0F the o#0Ecial result that employs the aboveword sense disambiguation algorithm.</Paragraph>
    <Paragraph position="12"> The results are summarized in Table 6. Although the di#0Berence between the use of WSD and the baseline is quite small, it turns out to be statistically signi#0Ccant. In some of the 20 input texts that were scored in coreference evaluation, the WSD module did not make any di#0Berence. However, whenever there was a di#0Berence it was always an improvement. It is also worth noting that, with WSD, both the recall and precision are increased.</Paragraph>
    <Paragraph position="13">  In hindsight, we probably should have just used the #0Crst sense listed in the WordNet for eachword.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML