XML Viewer - p00-1002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1002_metho.xml
Size: 27,614 bytes
Last Modified: 2025-10-06 14:07:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1002">
  <Title>Generic NLP Technologies: Language, Knowledge and Information Extraction</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Argument against linguistically
</SectionTitle>
    <Paragraph position="0"> elaborate techniques Throughout the 80s, research based on linguistics had #0Dourished even in application oriented NLP research such as machine translation. Eurotra, a European MT project, had attracted a large number of theoretical linguists into MT and the linguists developed clean and linguistically elaborate frameworks such as CTA-2, Simple Transfer, Eurotra-6, etc.</Paragraph>
    <Paragraph position="1"> ATR, a Japanese research institute for telephone dialogue translation supported by a consortium of private companies and the Ministry of Post and Communication, also adopted a linguistics-based framework, although they changed their direction in the later stage of the project. They also adopted sophisticated plan-based dialogue models as well at the initial stage of the project.</Paragraph>
    <Paragraph position="2"> However, the trend changed rather drastically in the early 90s and most research groups with practical applications in mind gave up such strategies and switched to more corpus-oriented and statistical methods. Instead of sentential parsing based on linguistically well founded grammar, for example, they started to use simpler but more robust techniques based on #0Cnite-state models. Neither did knowledge-based techniques like plan-recognition, etc. survive, which presume explicit representation of domain knowledge.</Paragraph>
    <Paragraph position="3"> One of the major reasons for the failure of these techniques is that, while these techniques alone cannot solve the whole range of problems that NLP application encounters, both linguistsand AI researchers made strong claims that their techniques would be able to solve most, if not all, of the problems. Although formalisms based on linguistic theories can certainly contribute to the development of clean and modular frameworks for NLP, it is rather obvious that linguistics theories alone cannot solve most of NLP's problems. Most of MT's problems, for example, are related with semantics or interpretation of language which linguistictheories of syntax can hardly o#0Ber solutions for #28Tsujii 1995#29. However, this does not imply, either, that frameworks based on linguistic theories are of no use for MT or NLP application in general.</Paragraph>
    <Paragraph position="4"> This only implies that we need techniques complementary to those based on linguistic theories and that frameworks based on linguistic theories should be augmented or combined with other techniques. Since techniques from complementary #0Celds such as statistical or corpus-based ones have made signi#0Ccant progresses, it is our contention in this paper that we should start to think seriously about combining the fruits of the research results of the 80s with those of the 90s.</Paragraph>
    <Paragraph position="5"> The other claims against linguistics-based and knowledge-based techniques which have often been made by practical-minded people are : #281#29 E#0Eciency: The techniques such as sentential parsing and knowledge-based inference, etc. are slow and require a large amount of memory #282#29 Ambiguity of Parsing: Sentential parsing tends to generate thousands of parse results from which systems cannot choose the correct one.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
#283#29 Incompleteness of Knowledge and
</SectionTitle>
      <Paragraph position="0"> Robustness: In practice one cannot provide systems with complete knowledge. Defects in knowledge often cause failures in processing, which result in the fragile behavior of systems.</Paragraph>
      <Paragraph position="1"> While these claims may have been the case during the 80s, the steady progress of such technologies have largely removed these dif#0Cculties. Instead, the disadvantages of current technologies based on #0Cnite state technologies, etc. have increasingly become clearer; the disadvantages such ad-hocness and opaqueness of systems which prevent them from being transferred from an application in one domain to another domain.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The current state of the JSPS
</SectionTitle>
    <Paragraph position="0"> project Ina#0Cve-year project funded by JSPS #28Japan Society of Promotion of Science#29 which started in September 1996, we have focussed our research on generic techniques that will be used for di#0Berent kinds of NLP application and domains.</Paragraph>
    <Paragraph position="1"> The project comprises three university groups from the University of Tokyo, Tokyo Institute of Technology #28Prof. Tokunaga#29 and Kyoto University #28Dr. Kurohashi#29, and coordinated by myself #28at the University of Tokyo#29. The University of Tokyo has been engaged in development of software infrastructure for e#0Ecient NLP, parsing technology and ontology building from texts, while the groups of Tokyo Institute of Technology and Kyoto University have been responsible for NLP application to IR and Knowledge-based NLP techniques, respectively.</Paragraph>
    <Paragraph position="2"> Since wehave delivered promisingresults in research on generic NLP methods, we are now engaged in developing several application systems that integrate various research results to show their feasibility in actual application environments. One such application is a system that helps biochemists working in the #0Celd of genome research.</Paragraph>
    <Paragraph position="3"> The system integrates various research results of our project such as new techniques for query expansion and intelligent indexing in IR, etc. The two results to be integrated into the system that we focus on in this paper are IE using a full-parser #28sentential parser based on grammar#29 and ontology building from texts.</Paragraph>
    <Paragraph position="4"> IE is very much in demand in genome research, since quite a large portion of research is now being targeted to construct systems that model complete sequences of interaction of various materials in biological organisms. These systems require extraction of relevant information from texts and its integration in #0Cxed formats. This entails that the researchers there should have a model of interaction among materials, into which actual pieces of information extracted from texts are #0Ctted. Such a model should have a set of classes of interaction #28event classes#29 and a set of classes of entities that participate in events. That is, the ontology of the domainshouldexist. However, sincethe buildingof an ultimate ontology is, in a sense, the goal of science, the explicit ontology exists only in a very restricted and partial form. In other words, IE and Ontology building are inevitably intertwined here.</Paragraph>
    <Paragraph position="5"> In short, we found that IE and Ontology building from texts in genome research provide an ideal test bed for our generic NLP techniques, namelysoftware infrastructurefor e#0Ecient NLP, parsing technology, and ontology building from texts with initial partial knowledge of the domain.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Software Infrastructure and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Parsing Technology
</SectionTitle>
      <Paragraph position="0"> While tree structures are a versatile scheme for linguistic representation, invention of feature structures that allow complex features and reentrancy #28structure sharing#29 makes linguistic representation concise and allows declarative speci#0Ccations of mutual relationships among representation of di#0Berent linguistic levels #28e.g.: morphology, syntax, semantics, discourse, etc.#29. More importantly, using bundles of features instead of simple non-terminal symbols to characterize linguistic objects allow us to use much richer statistical means such as ME #28maximum entropy model#29, etc. instead of simple probabilistic CFG. However, the potential has hardly been pursued yet mostly due to the ine#0Eciency and fragility of parsing based on feature-based formalisms. null In order to remove the e#0Eciency obstacle, we have in the #0Crst two years devoted ourselves to the developmentof: #28A#29 Software infrastructure that makes processing of feature-based formalisms e#0Ecient enough both for practical application and for combining it with statistical means.</Paragraph>
      <Paragraph position="1"> #28B#29 Grammar #28Japanese and English#29 with wide coverage for processing real world texts #28not examples in textbooks of linguistics#29. At the same time, processing techniques that make a system robust enough for application.</Paragraph>
      <Paragraph position="2"> #28C#29 E#0Ecient parsing algorithm for linguistics-based frameworks, in particular HPSG.</Paragraph>
      <Paragraph position="3"> We describe the current states of these three in the following.</Paragraph>
      <Paragraph position="4"> #28A#29 Software Infrastructure #28Miyao 2000#29: We designed and develop a programming system, LiLFeS, which is an extension of Prolog for expressing typed feature structures instead of #0Crst order terms. The system's core engine is an abstract machine that can process features and execute de#0Cnite clause program. While similar attempts treat feature structure processing separately from that of de#0Cnite clause programs, the LiLFeS abstract machine increases processing speed by seamlessly processing feature structures and de#0Cnite clause programs.</Paragraph>
      <Paragraph position="5"> Diverse systems, such as large scale English and Japanese grammar, a statistical disambiguation module for the Japanese parser, a robust parser for English, etc., have already been developed in the LiLFeS system.</Paragraph>
      <Paragraph position="6"> We compared the performance of the system with other systems, in particular with LKB developed by CSLI, Stanford University,by using the same grammar #28LinGo also provided by Stanford University#29. A parsing system in the LiLFeS system, which adopts a naive CKY algorithm without any sophistication, shows similar performance as that of LKB which uses a more re#0Cned algorithm to #0Clter out unnecessary uni#0Ccation. The detailed examination reveals that feature uni#0Ccation of the LiLFeS system is about four times faster than LKB.</Paragraph>
      <Paragraph position="7"> Furthermore, since LiLFeS has quite a few built-in functions that facilitate fast subsumption checking, e#0Ecient memory management, etc., the performance comparison reveals that more advanced parsing algorithms like the one we developed in #28C#29 can bene#0Ct from the LiLFeS system. Wehave almost #0Cnished the second version of the LiLFeS system that uses a more #0Cne-grained instruction set, directly translatable to naive machine code of aPentium CPU. The new version shows more than twice improvement in execution speed, which means the naive CKY algorithm without any sophistication in the LiLFeS system</Paragraph>
      <Paragraph position="9"> While LinGo that we used for comparison is an interesting grammar from the view point of linguistics, the coverage of the grammar is rather restricted. We have cooperated with the University of Pennsylvania to develop a grammar with wide coverage. In this cooperation, we translated an existing wide-coverage grammar of XTAG to the framework of HPSG, since our parsing algorithms in #28C#29 all assume that the grammar are HPSG. As we discuss in the following section, we will use this translated grammar as the core grammar for information extraction from texts in genome science.</Paragraph>
      <Paragraph position="10"> As for wide-coverage Japanese Grammar, we have developed our own grammar #28SLUNG#29 . SLUNG exploits the property of HPSG that allows under-speci#0Ced constraints. That is, in order to obtain wide-coverage from the very beginning of grammar development, we only give loose constraints to individual words that may over-generate wrong interpretations but nonetheless guarantee correct ones to be always generated.</Paragraph>
      <Paragraph position="11"> Instead of rather rigid and strict constraints, we prepare 76 templates for lexical entries that specify behaviors of words belonging to these 76 classes. The approach is against the spirit of HPSG or lexicalized grammar that emphasizes constraints speci#0Cc to individual lexical items. However, our goal is #0Crst to develop wide-coverage grammar that can be improved by adding lexicalitem speci#0Cc constraints in the later stage of grammar development. The strategy has proved to be e#0Bective and the current grammar can produce successful parse results for 98.3 #25 of sentences in the EDR corpus with high e#0Eciency #280.38 sec per sentence for the EDR corpus#29. Since the grammar overgenerates, wehave to choose single parse results among a combinatorially large numberofpossible parses. However, an experiment shows that a statistic method using ME #28we use the program for ME developed by NYU#29 can select around 88.6 #25 of correct analysis in terms of dependency relationships among ! ! bunsetsu's - the phrases in Japanese#29.</Paragraph>
      <Paragraph position="12"> #28C#29 E#0Ecient parsing algorithm #28Torisawa 2000#29: While feature structure representation provides an e#0Bective means of representing linguistic objects and constraints on them, checking satis#0Cability of constraints by linguistic objects, i.e. uni#0Ccation, is computationally expensive in terms of time and space. Oneway of improvingthe e#0Eciency isto avoid uni#0Ccation operations as much as possible, while the other way is to provide e#0Ecient software infrastructure such as in #28A#29. Once we choose a speci#0Cc task like parsing, generation, etc., we can devise e#0Ecient algorithms for avoiding uni#0Ccation.</Paragraph>
      <Paragraph position="13"> LKB accomplishes such reduction by inspecting dependencies among features, while the algorithm wechose is to reduce necessary uni#0Ccation by compiling given HPSG grammar into CFG. The CFG skeleton of given HPSG, which is semi-automatically extracted from the original HPSG, is applied to produce possible candidates of parse trees in the #0Crst phase. The skeletal parsing based on extracted CFG #0Clters out the local constituent structures which do not contribute to any parse covering the whole sentence. Since a large proportion of local constituent structures do not actually contribute to the whole parse, this #0Crst CFG phase helps the second phase to avoid most of the globally meaningless uni#0Ccation. The e#0Eciency gain by this compilationtechnique dependson the nature of the original grammar to be compiled.</Paragraph>
      <Paragraph position="14"> While the e#0Eciency gain for SLUNG is just two times, the gain for XHPSG #28HPSG grammar obtained by translating the XTAG grammar into HPSG#29 is around 47 times for the ATIS corpus #28Tateisi 1998#29.</Paragraph>
      <Paragraph position="15"> 5 Information extraction by sentential parsing The basic arguments against use of sentential parsing in practical application suchasIEare the ine#0Eciency in terms of time and space, the fragility of systems based on linguistically rigid frameworks and highly ambiguous parse results that we often have as results of parsing. null On the other hand, there are arguments for sentential parsing or the deep analysis approach. One argument is that an approach based on linguistically sound frameworks makes systems transparent and easy to re-use. The other is the limit on the quality that is achievable by the pattern matching approach. While a higher recall rate of IE requires a large amount of patterns to cover diverse surface realization of the same information, wehave to widen linguistic contexts to improve the precision by preventing extraction of false information. A pattern-based system may end up with a set of patterns whose complex mutual nullifythe initial appeal of simplicityofthe pattern-based approach. null As we see in the previous section, the e#0Eciency problem becomes less problematic by utilizing the current parsing technology. It is still a problem when we apply the deep analysis to texts in the #0Celd of genome science, which tend to have much longer sentences than in the ATIS corpus. However, as in the pattern-based approach, we can reduce the complexity of problems by combining different techniques.</Paragraph>
      <Paragraph position="16"> In a preliminary experiment, we #0Crst use a shallow parser #28ENGCG#29 to reduce part-of-speech ambiguities before sentential parsing. Unlike statistic POS taggers, the constraint grammar adopted by ENGCG preserves all possiblePOS interpretations just by dropping interpretationsthat are impossibleingivenlocal contexts. Therefore, the use of ENGCG does not a#0Bect the soundness and completeness of the whole system, while it reduces signi#0Ccantly the local ambiguities that do not contribute to the whole parse.</Paragraph>
      <Paragraph position="17"> The experiment shows that ENGCG prevents 60 #25 of edges produced by a parser Based on naive CKY algorithm, when it is applied to 180 sentences randomly chosen from MEDLINE abstracts #28Yakushiji 2000#29. As a result, the parsing by XHPSG becomes four times faster from 20.0 seconds to 4.8 second per sentence, which is further improved by using chunking based on the output of a Named Entity recognition tool to 2.57 second per sentence. Since the experiment was conducted with a naive parser based on CYK and the old version of LiLFeS, the performance can be improved further.</Paragraph>
      <Paragraph position="18"> The problems of fragility and ambiguity still remain. XHPSG fails to produce parses for about half of the sentences that cover the whole. However, in application such as IE, a system needs not have parses covering the whole sentence. If the part in which the relevant pieces of information appear can be parsed, the system can extract them. This is one of the major reasons why pattern-based systems can work in a robust manner. The same idea can be used in IE based on sentential parser. That is, techniques that can extract information from partial parse results will make the system robust.</Paragraph>
      <Paragraph position="19"> The problem of ambiguity can be treated in a similar manner. In a pattern-based system, the system extracts informationwhen parts of the text match with a pattern, independently of whether other interpretations that compete with the interpretation intended by the pattern exist or not. In this way, a pattern-based system treats ambiguity implicitly. In case of the approach based on sentential parsing, we treat the ambiguity problemby preference.</Paragraph>
      <Paragraph position="20"> That is, an interpretation that indicates relevant pieces of information exist is preferred to other interpretations.</Paragraph>
      <Paragraph position="21"> Although the methods illustrated in the above make IE based on sentential parsing similar to the pattern-based approach, the approach retains the advantages over the pattern-based one. For example, it can prevent false extraction if the pattern that dictates extraction contradicts with wider linguistic structures or with the more preferred interpretations. It keeps separate the general linguistic knowledge embodied in the form of XHPSG grammar that can be used in any domain. The mapping between syntactic structures to predicate structures can also be systematic. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Information extraction of named
</SectionTitle>
    <Paragraph position="0"> entities using a hidden Markov model The named entity tool mentioned above, called NEHMM #28Collier 2000#29, has been developed as a generalizable supervised learning method for identifying and classifying terms given a training corpus of SGML marked-up texts. HMMs themselves belong to a class of learning algorithms that can be considered to be stochastic #0Cnite state machines. They have enjoyed success in a wide number of #0Celds including speech recognition and part of speech tagging. We therefore consider their extension to the named entity task, which is essentially a kind of semantic tagging of words based on their class, to be quite natural.</Paragraph>
    <Paragraph position="1"> NEHMM itself strives to be highly generalizable to terms in di#0Berent domains and the initial version uses bigrams based on lexical and character features with one state per name class. Data-sparseness is overcome using the character features and linearinterpolation. null Nobata et al. #28Nobata 1999#29 comment on the particular di#0Eculties with identifying and classifying terms in the biochemistry domain including an open vocabulary and irregular naming conventions as well as extensive cross-over invocabulary between classes. The irregularnamingarises inpart because of the number of researchers from di#0Berent #0Celds who are working on the same knowledge discovery area as well as the large number of proteins, DNA etc. that need to be named. Despitethe beste#0Borts of majorjournalsto standardize the terminology, there is also a signi#0Ccant problem with synonymy so that often an entity has more than one name that is widely used such as the protein names AKT and PKB. Class cross-over of terms is another problem that arises because many DNA and RNA are named after the protein with which they transcribe.</Paragraph>
    <Paragraph position="2"> Despite the apparent simplicity of the knowledge in NEHMM, the model has proven to be quite powerful in application. In the genome domain with only 80 training MEDLINE abstracts it could achieve over 74#25 F-score #28a common metric for evaluation used in IE that combines recall and precision#29. Similar performance has been found when training using the dry-run and test set for MUC-6 #2860 articles#29 in the news domain.</Paragraph>
    <Paragraph position="3"> The next stage in the development of our model is to train using larger test sets and to incorporate wider contextual knowledge, perhaps by marking-up for dependencies of named-entities in the training corpus. This extra level of structural knowledge should help to constrain class assignment and also to aid in higher levels of IE suchasevent extraction. null</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Knowledge Building and Text
Annotation
</SectionTitle>
    <Paragraph position="0"> Annotated corpora constitute not only an integral part of a linguistic investigation but also an essential part of the design methodology foranNLP systems. Inparticular, the design of IE systems requires clear understanding of information formats of the domain, i.e. what kinds of entities and events are considered as essential ingredients of information.</Paragraph>
    <Paragraph position="1"> However, such information formats are often implicit in the minds of domain specialists and the process of annotating texts helps to reveal them.</Paragraph>
    <Paragraph position="2"> It is also the case that the mapping between information formats and surface linguistic realization is not trivial and that capturing the mapping requires empirical examination of actual corpora. While generic programs with learning ability may learn such a mapping, learning algorithms need training data, i.e. annotated corpora.</Paragraph>
    <Paragraph position="3"> In order to design a NE recognition program, for example, wehavetohave a reasonable amount of annotated texts which show in what linguistic contexts named entities appear and what internal structures typical linguisticexpressionsof namedentitiesofa given #0Celd have. Such human inspection of annotated texts suggests feasible tools for NE #28e.g. HMM, ME, decisiontrees, dictionarylook-up, etc.#29 and a set of feasible features, if one uses programs with learning ability. Human inspection of annotated corpora is still an inevitable step of feature selection, even if one uses programs with learning ability.</Paragraph>
    <Paragraph position="4"> More importantly, to determine classes of named entities and events which should re#0Dect the views of domain specialists requires empirical investigation, since these often exist implicitlyonly inthe mindof specialists. This is particularly the case in the #0Celd of medical and biological sciences, since they have a much larger collection of terms #28i.e. class names#29 than, for example, mathematical science, physics, etc.</Paragraph>
    <Paragraph position="5"> In order to see the magnitude of the work and di#0Eculties involved, we chose a wellcircumscribed#0Celd and collected texts #28MED-LINE abstracts#29 in the #0Celd to be annotated. The #0Celd is the reaction of transcription factors in human blood cells. The kinds of information that we try to extract are the information on protein-protein interactions.</Paragraph>
    <Paragraph position="6"> The #0Celd was chosen because a research group of National Health Research Institute of the Ministry of Health in Japan is building a database called CSNDB #28Cell Signal Network DB#29, which gathers this type of information. They read papers every week to extract relevant information and store them in the database. IE of this #0Celd can reduce the work that is done manually at present.</Paragraph>
    <Paragraph position="7"> We selected abstracts from MEDLINE by the key words of &amp;quot;human&amp;quot;, &amp;quot;transcription factors&amp;quot; and &amp;quot;blood cells&amp;quot;, which yield 3300 abstracts. The abstracts are from 100 to 200 words in length. 500 abstracts were chosen randomly and annotated. Currently, semantic annotation of 300 abstracts has been #0Cnished and we expect 500 abstracts to be done by April #28Ohta 2000#29.</Paragraph>
    <Paragraph position="8"> The task of annotation can be regarded as identifying and classifying the terms that appear in texts according to a pre-de#0Cned classi#0Ccation scheme. The classi#0Ccation scheme, in turn, re#0Dects the view of the #0Celds that biochemists have. That is, semantic tags we use are the class names in an ontology of the #0Celd.</Paragraph>
    <Paragraph position="9"> Ontologies of biological terminology have been created in projects such as the EU funded GALEN project to provide a model of biological concepts that can be used to integrate heterogeneous information sources while some ontologies such as MeSH are built for the purpose of information retrieval According to their purposes, ontologies di#0Ber from #0Cne-grained to coarse ones and from associative to logical ones. Since there is no appropriate ontology that covers the domain that we are interested in, we decided to build one for this speci#0Cc domain.</Paragraph>
    <Paragraph position="10"> The design of our ontology is in progress, in which we distinguish classi#0Ccation based on roles that proteins play in events from that based on internal structures of proteins.</Paragraph>
    <Paragraph position="11"> The former classi#0Ccation is closely linked with classi#0Ccation of events. Since classi#0Ccation is based on feature lattices, we plan to use the LiLFeS system to de#0Cne these classi#0Ccation schemes and their relationships among them.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML