XML Viewer - w04-1802

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1802_metho.xml
Size: 25,719 bytes
Last Modified: 2025-10-06 14:09:16
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1802">
  <Title>Metalinguistic Information Extraction for Terminology</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 The MeSH and SPECIALIST vocabularies, a
</SectionTitle>
    <Paragraph position="0"> Metathesaurus, a Semantic Network, etc.</Paragraph>
    <Paragraph position="1"> dated fairly quickly, and elucidating this information from domain experts is not an option. Neology detection, terminological information update and other tasks can benefit from automatic search, in highly technical text, of semantic and pragmatic information, e.g. when new information about sublanguage usage is being put forward. In this paper we describe and evaluate the Metalinguistic Operation Processor (MOP) system, implemented to automatically create Metalinguistic Information Databases (or MIDs) from large collections of special-domain research and reference documents. Section 2 discusses previous work, while Section 3 provides an overview of metalinguistic exchanges between experts, and their role in the constitution of technical knowledge. Section 4 presents experiments to localize and disambiguate good candidate metalinguistic sentences, using rule-based and stochastic learning strategies. Section 5 focuses on the problem of identifying and structuring the different linguistic constituents and surface segments of metalinguistic predications.</Paragraph>
    <Paragraph position="2"> Finally, Section 6 offers a discussion of results and suggestions for possible applications and future lines of research.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Previous work
</SectionTitle>
    <Paragraph position="0"> One of the constraints of recent lines of research (Pearson, 1998; Klavans et al., 2001; Pascual &amp; Pery-Woodley, 1997) is their focus on definitions, a theoretical object that, although undoubtedly useful and extensively described, presents by its very nature certain limitations when studying expert-domain peer-to-peer communication.2 The meaning normalization process inherent in 2 In some recent approaches, Meyer (2001) and Condamines &amp; Rebeyrolles (2001) exploit wider lexico-conceptual relations in free-text that can be difficult to model and locate accurately.</Paragraph>
    <Paragraph position="1"> CompuTerm 2004 - 3rd International Workshop on Computational Terminology 15 compiling definitions may be desirable when creating human-readable reference sources, but might lead to a loss of valuable information for specific contexts where the term appears.</Paragraph>
    <Paragraph position="2"> Pragmatic information (valid usage conditions or contextual restriction for the terms), or purely evaluative statements (usefulness or validity of a certain term for its intended purpose), might not be found in classical definitional contexts.</Paragraph>
    <Paragraph position="3"> Metalinguistic information in texts can provide us with information not only about what terms mean, but also how they are actually used by domain experts. A wide spectrum of sentential realizations of these kinds of information has been reported by Meyer (2001) and Rodriguez (2001), and organizing it to provide useful terminological resources is left for manual review by human lexicographers. We believe that using the more general concept of metalanguage can automate as much as possible the extraction of fine-grained knowledge about terms, as well as better capture the dynamical nature of the evolution of the scientific and technical knowledge created through the interaction of expert-domain groups.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Corpora used in our research
</SectionTitle>
      <Paragraph position="0"> Preliminary empirical work to explore how researchers modify the terminological framework of their highly complex conceptual systems included an initial manual review of 19 sociology articles (138k words) in academic journals. We looked at how term introduction and modification was done, as well as how metalinguistic activity was signalled in text, both by lexical and paralinguistic means. Some of the indicators found included verbs and verbal phrases like called, known as, defined as, termed, coined, dubbed, and descriptors such as term and word.</Paragraph>
      <Paragraph position="1"> Non-lexical markers included quotation marks, apposition and text layout.3 The metalinguistic patterns thus identified were expanded (using variations of lexemes, verbal tenses and forms) into 116 queries to the scientific and learned domains of the British National Corpus. The resulting 10,937 sentences (henceforth, the MOP</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Similar work by Pearson (1998) obtained many of
</SectionTitle>
    <Paragraph position="0"> the same patterns from the Nature corpus of exact science documents.</Paragraph>
    <Paragraph position="1"> corpus) were manually classified as metalinguistic or otherwise, with 5,407 (49.6% of total) found to be truly metalinguistic sentences, using the criteria described in Section 3.2 below.4 Other corpora from different domains (described in Section 4) was used both in this preliminary analysis of metalinguistic exchanges, as well as in evaluation and development of the MOP system.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Explicit Metalinguistic Operations
</SectionTitle>
      <Paragraph position="0"> Careful analysis of these corpora, as well of examples in other European languages, presented some interesting facts about what we have termed &amp;quot;Explicit Metalinguistic Operations&amp;quot; (or EMOs):5 A) EMOs do not usually follow the genusdifferentia scheme of aristotelian definitions, nor conform to the rigid and artificial structure of lexicographic entries. More often than not, specific information about language use and term definition is provided by sentences such as (1), in which the term trachea is linked to the description fine hollow tubes in the context of a globally non- null metalinguistic sentence: (1) This means that they ingest oxygen from the  air via fine hollow tubes, known as tracheae.</Paragraph>
      <Paragraph position="1"> In research papers partial and heterogeneous information is much more common than complete definitions, although it might otherwise in textbooks geared towards learning a discipline. B) Introduction of metalinguistic information in discourse is highly regular, regardless of the domain. This can be credited to the fact that the writer needs to mark these sentences for special processing by the reader, as they dissect across two different semiotic levels: a meta-language and its object language, to use the terminology of logic where these concepts originated.6 Their</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Reliability of human subjects for this task has not
</SectionTitle>
    <Paragraph position="0"> been reported in the literature, and was not evaluated in our experiments.</Paragraph>
    <Paragraph position="1"> 5 We have used the term to highlight the operational nature of such textual instances in technical discourse. 6 Natural language has to be split (at least methodologically) into two distinct systems that share the same rules and elements: a metalanguage used to refer to an object language, which in turn can refer to and describe objects in the mind or in the physical world. The fact that the two are isomorphic accounts for reflexivity, the property of referring to itself, as when linguistic items are mentioned instead of being used normally in an utterance. Rey-Debove (1978) CompuTerm 2004 - 3rd International Workshop on Computational Terminology16 constitutive markedness means that most of the times these sentences will have at least two indicators of metalinguistic nature. These formal and cognitive properties of EMOs facilitate the task of locating them accurately in text.</Paragraph>
    <Paragraph position="2"> C) EMOs can be further analyzed into 3 distinct components, each with its own properties and linguistic realizations: i) An autonym (see note 6): One or more self-referential lexical items that are the logical or grammatical subject of a predication.</Paragraph>
    <Paragraph position="3"> ii) An informational segment: a contribution of relevant information about the meaning, status, coding or interpretation of a linguistic unit. Informational segments constitute what we state about the autonymical element.</Paragraph>
    <Paragraph position="4"> iii) Markers/Operators: Elements used to make prominent the whole discourse operation and its non-referential, metalinguistic nature.</Paragraph>
    <Paragraph position="5"> They are usually lexical, paralinguistic or pragmatic devices that articulate autonyms and informational segments into a predication.</Paragraph>
    <Paragraph position="6"> In a sentence such as (2) we have marked the autonym with italics, the informational segment with bold type and the marker-operator items with square brackets: (2) The bit sequences representing quanta of knowledge [ will be called &amp;quot; ] Kenes [ &amp;quot; ], a neologism intentionally similar to 'genes' .</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Knowledge and knowledge of language
</SectionTitle>
      <Paragraph position="0"> Whenever scientists advance the state of the art of a discipline, their language has to evolve and change, and this build-up is carried out under metalinguistic control. Previous knowledge is transformed into new scientific common ground and ontological commitments are introduced when semantic reference is established. That is why when we want to structure and acquire new knowledge we have to go through a resourcecostly cognitive process that integrates within coherent conceptual structures and theories a considerable amount of new and very complex lexical items and terms. Technical terms are not, by definition, part of the far larger linguistic competence of a first native language. Unlike everyday words within a specific social group, follows Carnap in calling this condition autonymy.</Paragraph>
      <Paragraph position="1"> terms are conventional, even if they have derived from a word that originally belonged to collective competence. We could even posit that all technical terms owe their existence to a baptismal speech act, and that given a big enough sample (an impossibly exhaustive corpus of all expert language exchanges), an initial metalinguistic sentence could be located that constitutes an original, foundational source of meaning.</Paragraph>
      <Paragraph position="2"> The information provided by metalinguistic exchanges is not usually inferable from previous one available to the speaker's community, and does not depend on general language competence by itself, but nevertheless is judged important and relevant enough to warrant the additional processing effort involved. Computing what is relevant metalinguistic information has to be done dynamically by figuring out which terminological items can be assumed to be shared by all, and which are new or have to be modified. It's an extended and more complex instance of lexical alignment between interlocutors (Pickering &amp; Garrod, in press). Observing closely how this alignment is achieved can allow us to create computer applications that mimic some aspects of our impressive human competence as efficient readers of technical subjects, as incredibly good lexical-data processors that constantly update and construct our own special purpose vocabularies.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Filtering out non-metalinguistic sentences:
</SectionTitle>
    <Paragraph position="0"> two NLP approaches The first issue to tackle when mining metalanguage is how to obtain a reliable set of candidate sentences for input into the next extraction phases. We employ a &amp;quot;discourseoriented&amp;quot; approach that differs from Meyer's (2001) &amp;quot;term-oriented&amp;quot; one. We do not assume we have initially identified a terminological unit and proceed from there, but rather we first locate a metalinguistic discourse operation where a term can be retrieved along with information that refers to it. Condamines &amp; Rebeyrolles (2001) and Meyer (2001) both exploit patterns of &amp;quot;knowledge-rich contexts&amp;quot; to obtain semantic and conceptual information about terms, either to inform terminological definitions or provide structure for a terminological system. A key problem in such approaches that use lexical-based &amp;quot;triggers&amp;quot; is how to control the amount of &amp;quot;noise&amp;quot;, or non-relevant instances. The experiments in this CompuTerm 2004 - 3rd International Workshop on Computational Terminology 17 section compare two different NLP techniques for this task: symbolic and statistic techniques.</Paragraph>
    <Paragraph position="1"> From our initial analysis of various corpora we selected 44 patterns that showed the best statistical reliability as EMO indicators.7 We started out by tokenizing text, which then was run through a cascade of finite-state devices that extracted a set of candidate sentences before filtering out non-metalinguistic instances. Our filtering distinguishes between useful results, e.g.</Paragraph>
    <Paragraph position="2"> using the lexical pattern called in (3) from non- null metalinguistic instances in (4): (3) Since the shame that was elicited by the coding procedure was seldom explicitly mentioned by the patient or the therapist, Lewis called it unacknowledged shame.</Paragraph>
    <Paragraph position="3"> (4) It was Lewis (1971;1976) who called attention  to emotional elements in what until then had been construed as a perceptual phenomenon .</Paragraph>
    <Paragraph position="4"> We experimented with two strategies for disambiguation: first, we used collocations as added restrictions (e.g., verbal vs. nominal occurrences of our lexical markers) to discard non-metalinguistic instances, for example attention in sentence (4) next to the marker called. The next table shows a sample of the filtering collocations.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Preceding Subsequent
for calls
</SectionTitle>
      <Paragraph position="0"> in, duty, personal, conference, local, next, the, their, house, anonymous, phone, telephone...</Paragraph>
      <Paragraph position="1"> out, someone, charges, before, charge, back, contact, for, upon, to, into, off, 911, by...</Paragraph>
      <Paragraph position="2"> for coin pound, small, pence, in, toss, the, this, a, that, one, gold, silver, metal, esophageal ...</Paragraph>
      <Paragraph position="3"> toss We also implemented learning algorithms trained on a subset from our EMO corpus, using as vectors either Part-of Speech tags or word strings, at one, two, and three positions adjacent before and after our lexical markers. Our evaluations are based on 3 document sets: a) our original exploratory sociology corpus [5,581 sentences, 243 EMOs]; b) an online histology textbook [5,146 sentences, 69 EMOs]; and c) a small sample from the MedLine abstract database [1,403 sentences, 10 EMOs]. Our system is coded 7 We excluded dispositional and typographical clues from our selectional patterns, involving mainly lexica and punctuation.</Paragraph>
      <Paragraph position="4"> in Python, using the NLTK platform (nltk.sf.net) and a Brill tagger by Hugo Liu at MIT.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The collocation-based approach
</SectionTitle>
      <Paragraph position="0"> Our first approach fared well, with good precision numbers but not so encouraging recall.</Paragraph>
      <Paragraph position="1"> The sociology corpus gave 0.94 Precision (P) and  0.68 Recall (R), while the histology one presented 0.9 P and 0.5 R. These low recall numbers reflect the fact that we used a non-exhaustive list of metalinguistic patterns. Example (5) shows one kind of metalinguistic sentence attested in corpora that the system does not extract or process: (5) &amp;quot;Intercursive&amp;quot; power, on the other hand, is  power in Weber's sense of constraint by an actor or group of actors over others.</Paragraph>
      <Paragraph position="2"> We also tested extraction against a golden standard where sentences that had patterns that our list was not designed to retrieve were removed, which gave a more realistic picture of how the extraction system worked for the actual dataset it was designed to consider. For the sociology corpus (and a ss factor of 1), P was 0.97 and R 0.79, with an F-measure of 0.87. In the histology one P was measured at 0.94, R at 0.81 and F-measure at 0.87. In order to better compare the two filtering strategies, we decided also to zoom in on a more limited subset of verb forms (namely, calls, called, call), which presented ratios of metalinguistic relevance in our MOP corpus ranging from 100% positives (for the pattern so called + quotation marks) to 31% (call). Restricted to these verbs, our metrics showed precision and recall rates around 0.97. One problem with this approach is that the hand-coded rules are domain-specific, and customization for other domains is labour-intensive. In our tests, although most of the collocations work languagewide (phrasal verbs or prepositions), some of them are very specific.8 Although collocation-based filtering will result in a working system, customization is error-prone and laborious.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Testing learning algorithms
</SectionTitle>
      <Paragraph position="0"> We selected the co-text of marker/operators as relevant features for classifiers based on well-known naive Bayes and Maximum Entropy algorithms that have been reported to work well 8 &amp;quot;esophageal coins&amp;quot; is quite unusual outside of medical documents.</Paragraph>
      <Paragraph position="1"> CompuTerm 2004 - 3rd International Workshop on Computational Terminology18 with sparse data.9 We used either as grammatical context the POS tags or the word forms immediately adjacent in one to three positions before and after our triggering markers. Testing all possible combinations evaluates empirically the ideal mix of algorithm, feature type and coverage that insures best accuracy. The naive Bayes algorithm estimates the conditional probability of a set of features given a label, using the product of the probabilities of the individual features given that label. It assumes that the feature distributions are independent, but it has been shown to work well in cases with high degree of feature dependencies. The Maximum Entropy model establishes a probability distribution favouring entropy or uniformity subject to the constraints encoded in the featureknown label correlation. To train our classifiers, Generalized and Improved Iterative Scaling algorithms were used to estimate the optimal maximum entropy of a feature set, given a corpus.10 1,371 training sentences from our MOP dataset were converted into YES-NO labelled vectors. The following example from the textual segment &amp;quot;... creates what Croft calls a description constraint ...&amp;quot;, uses 3 positions and POS tags: ('VB WP NNP', 'calls', 'DT NN NN')/'YES'@[102].</Paragraph>
      <Paragraph position="2"> The different number of positions to the left and right of our training sentences, as well as the nature of the features selected (there are many more word-types than POS tags) ensured that our 3-part vector introduced a wide range of features against our 2 possible labels. The best results of each algorithms restricted to the lexeme call, are presented in the next table. Figures 1 and 2 present best results in the learning experiments for the complete set of patterns used in the collocation approach, over two of our evaluation corpora.11  9 see Rish, 2001, Ratnaparkhi, 1997 and Berger et al, 1996 for a formal description of these algorithms.</Paragraph>
      <Paragraph position="3"> 10 In other words, given known data statistics, construct a model that best represents them but is otherwise as uniform as possible.</Paragraph>
      <Paragraph position="4"> 11 Legend: P: Precision; R: Recall; F: F-Measure. NB: naive</Paragraph>
      <Paragraph position="6"/>
      <Paragraph position="8"> Although our tests using collocations showed that structural regularities would perform well, our intuitions about improvement using more features (more positions to the right or left of the lexical markers) or a more grammatically restricted environment (surrounding POS tags), turned out to be overly optimistic. Nevertheless, stochastic approaches that used short-range features did perform in line with the hand-coded approach. Both Knowledge-Engineering and supervised learning approaches were adequate for initial filtering of metalinguistic sentences, although learning algorithms might allow easier transport of systems into new domains.</Paragraph>
      <Paragraph position="9"> 5 From EMOs to metalinguistic databases After EMOs were obtained, POS tagging, shallow parsing and limited PP-attachment are performed. Resulting chunks were tagged as Autonyms, Agents, Markers, Anaphoric elements or Noun Chunks, using heuristics based on syntactic, pragmatic and argument structure of lexica in the extraction patterns, as well as on FrameNet data in Name conferral and Name CompuTerm 2004 - 3rd International Workshop on Computational Terminology 19 bearing frames. Next, a predicate processing phase selected the most likely surface realization for informational segments, autonyms and makers-operators, and proceeded to fill out the templates of the database. As mentioned earlier, informational segments present many realizations far from the completeness and conciseness of lexicographic entries. In fact, they may show up as full-fledged clauses (6), as inter- or intra-sentential anaphoric elements (7 and 8), as sortal information (9), or as an unexpressed &amp;quot;existential variable&amp;quot; (logical form [?]x) indicating only that certain discourse entity is being introduced (10):  (6) In 1965 the term soliton was coined to describe waves with this remarkable behaviour.</Paragraph>
      <Paragraph position="10"> (7) This leap brings cultural citizenship in line with what has been called the politics of citizenship .</Paragraph>
      <Paragraph position="11"> (8) They are called &amp;quot;endothermic compounds.&amp;quot; (9) One of the most enduring aspects of all social theories are those conceptual entities known as structures or groups.</Paragraph>
      <Paragraph position="12"> (10) A [$x] so called cell-type-specific TF can be  used by closely related cells....</Paragraph>
      <Paragraph position="13"> We have not included an anaphora-resolution module in our system, so that examples 7, 8 and 10 only output either unresolved surface elements or variable placeholders.12 Nevertheless, more common occurrences like example sentence (1) 12 For sentence (8) the system might retrieve useful information from a previous one: &amp;quot;A few have positive enthalpies of formation.&amp;quot; are enough to create MIDs that constitute useful resources for lexicographers. The correct database entry for (1) is presented below.</Paragraph>
      <Paragraph position="14"> Reference Histology sample # 6 Autonym tracheae Information fine hollow tubes Markers/Operators known as To better reflect overall performance, we introduced a threshold of similarity of 65% for comparison between a golden standard slot entry and the one obtained by the application.13 The final processing stage presented metrics shown in Figure 4. Our best numbers for informational segments ranged around 0.85, while the lowest were obtained for the histology corpus, with global precision and recall rates around 0.71, but with high numbers in the autonym identification task (0.91) and midrange ones for the informational segments (0.8). We observed that even though it is assumed that Bio-Medical Sciences have more consolidated vocabularies than Social Sciences, results for the MedLine and histology corpus occupy the extremes in the spectrum, with the sociology one in the middle range. The total number of candidate sentences was not a good predictor of system performance. The DEFINDER system (Klavans et al., 2001) is to my knowledge the only one fully comparable with MOP, both in scope and goals, but with some significant differences.14 Taking into account 13 Thus, if the autonym or the informational segment is at least 2/3 of the correct response, it is counted as a positive, allowing for expected errors in the PP or acronym attachment algorithms.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
14 DEFINDER examines user-oriented documents
</SectionTitle>
    <Paragraph position="0"> those differences, MOP compares well with the 0.8 precision and 0.75 recall of DEFINDER.</Paragraph>
    <Paragraph position="1"> While the resulting MOP &amp;quot;definitions&amp;quot; generally do not present high readability or completeness, these informational segments are not meant to be read by laymen, but used by domain lexicographers updating existing glossaries for neological change, or, in machine-readable form, by other applications.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML