File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/82/c82-1032_metho.xml

Size: 6,608 bytes

Last Modified: 2025-10-06 14:11:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="C82-1032">
  <Title>ANALYSIS AND PROCESSING OF COMPACT TEXT</Title>
  <Section position="3" start_page="0" end_page="201" type="metho">
    <SectionTitle>
DATA ANDMETHODS OF ANALYSIS
</SectionTitle>
    <Paragraph position="0"> The data of this study consist of eight medical discharge summaries, each 1-2 pages in length. These physician reports were dictated and then transcribed by a medical typist into computer readable form. As a preliminary to machine processing, there was minimal pre-editing of the documents for such necessary formatting changes as  202 E. MARSH and N. SAGER inserting a blank between a word and its following punctuation mark and occasional spelling corrections. Abbreviations were maintained as used in the text and were treated as dictionary entries linked to their full form in the dictionary. A computer dictionary of the document words was obtained by look-up in the medical parsing dictionary developed by the LSP, which was augmented for new words appearing in these texts. The dictionary classifies words according to their major parts of speech (e.g. noun, verb, adjective), as well as certain English subclasses (e.g. plural, past) and special medical subfield classes. The medical classes are based on cooccurrence properties of the words as seen in a larger survey of the material and are checked for semantic consistency by a physician-consultant. In total, the medical classes currently number sixty-six. These are used in selectional constraints applied during parsing to resolve syntactic ambiguities. A smaller set of eighteen medical classes determine the major semantic sentence types discussed below. Descriptions of the 18 medical classes are given in Table i, drawn from \[6\], where all the medical classes are defined and details of the text processing procedure are described.</Paragraph>
    <Paragraph position="1"> The input sentences from the discharge summaries are structured by a three stage system of (i) parsing, (2) syntactic regularizat~on: and (3) mapping into an information format.</Paragraph>
    <Paragraph position="2"> The first step is to parse the document sentences with the Linguistic String Project parser \[7\]. This step begins with a dictionary lookup to associate the stored lexical information with each word occurrence im the sentence. The parsing utilizes a medically tuned grammar, which includes productions for the compact sentence syntactic types discussed below, and also productions for special sublanguage constructions (e.g. dose expressions, penicillamine 250 MG PO (~D_). The sentence parse identifies grammatical relations which hold among parts of the sentence, principally subject-verb-object relations and host-modifier relations. Also built into the grammar is a selectional mechanism which disambiguates multiply classified words based on the type of subject-verb-object or host-modifier relationships permitted in the sublanguage \[8\].</Paragraph>
    <Paragraph position="3"> The second step is syntactic regularization. Each sentence undergoes a series of paraphrastic English transformations regularizing the syntax within the sentences in order to reduce the variety of syntactic structures to a set of basic syntactic relations \[ 9 \]. The syntactic regularization does not alter the information content of the sentences. In addition, reduced word forms, such as abbreviations, are replaced by their full forms.</Paragraph>
    <Paragraph position="4"> The final processing step is information formatting, which maps the words of the parsed, transformed sentence into a tabular representation of the information contained in the sentences. A word is mapped into the format column which corresponds to the information content of the word. In general, there is a l-to-i correspondence between the sublanguage word class and a particular format column. For example, a word of the medical class DIAG(nosis), e.g. rae~, would be mapped into the DIAG column of the format. Formatting is based on cooccurrence patterns found in the text and on the lexical information obtained from the computer dictionary of medical vocabulary. The sets of filled format columns represent the semantic patterns found within the data. These semantic patterns will be discussed below.</Paragraph>
    <Paragraph position="5"> The data was run on the LSP natural language processing system as implemented in FORTRAN and run on a Control Data 6600, requiring about 75,000 words of memory.</Paragraph>
    <Paragraph position="6"> The LSP system also runs on the CDC C~ER, the VAX, and UNIVAC ii00 machines. The English grammar, regularlzation component, and information formatting component are written in Restriction Language, a special high level language developed for writing natural language grammars.</Paragraph>
  </Section>
  <Section position="4" start_page="201" end_page="203" type="metho">
    <SectionTitle>
ANALYSIS AND PROCESSING OF COMPACT TEXT
</SectionTitle>
    <Paragraph position="0"> description of body-functions, hearing, movement.</Paragraph>
    <Paragraph position="1"> standard body-measures, ~, temperature, blood ~ressure. location of test or symptom, han___dd, ~, test or technique performed during physical exam percussion, hear (rales).</Paragraph>
    <Paragraph position="2"> patient growth word, ~, ub~, birth.</Paragraph>
    <Paragraph position="3"> neutral descriptor term, ely_~!pw, fla____~t, pale.</Paragraph>
    <Paragraph position="4"> diagnosis word, meningitis, sickle cell disease. institution, clinic, or doctor, emergency room, hematology, local doctor. result of laboratory test or culture, generally contains agent words, pneumococcus type 18, ath~.</Paragraph>
    <Paragraph position="5"> medication or specific treatment, penicillin, ampicillin, transfusion. word indicating normalcy or change towards normalcy, normal, ~, convalesce, improvement. patient numerical quantifier, possibly with unit. non-normal sign or symptom, crisi_____~s, cold, headache.</Paragraph>
    <Paragraph position="6"> laboratory test, including x-rays, chemistry, bacteriology, and hematology, x-ra\[, urinalysis, hematocrit.</Paragraph>
    <Paragraph position="7"> general patient management, admission, follo______ww, ~, decision. patient or symptom response word, respond to, controlled by.</Paragraph>
    <Paragraph position="8"> general treatment verb or noun, treatmen___~t, ~.</Paragraph>
    <Section position="1" start_page="203" end_page="203" type="sub_section">
      <SectionTitle>
Medical Word Subclasses
</SectionTitle>
      <Paragraph position="0"> 204 E. MARSH and N. SAGER</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML