File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/h93-1060_abstr.xml

Size: 4,635 bytes

Last Modified: 2025-10-06 13:47:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1060">
  <Title>The COMLEX Syntax Project i.</Title>
  <Section position="2" start_page="0" end_page="300" type="abstr">
    <SectionTitle>
1. Some COMLEX History
</SectionTitle>
    <Paragraph position="0"> There is a long history of trying to design shareable or &amp;quot;polytheoretic&amp;quot; lexicons and interchange formats for lexicons. There has also been substantial work on adapting machine-readable versions of conventional dictionaries for automated language analysis using a number of systems. It is not our intent to review this work here, but only to indicate how our particular project -- COMLEX Syntax -- got started.</Paragraph>
    <Paragraph position="1"> The initial impetus was provided by Charles Wayne, the DARPA/SISTO program manager, in discussions at a meeting held at New Mexico State University in January 1992 to inaugurate the Consortium for Lexical Research. These discussions were further developed at a session at the February, 1992 DARPA Speech and Natural Language Workshop at Arden House; a number of proposals were offered there for both interchange standards and shareable dictionaries and grammars. At a subsequent DARPA meeting in July 1992 these ideas crystallized into a proposal by James Pustejovsky and Ralph Gnshman to the Linguistic Data Consortium to fund a COMLEX effort.</Paragraph>
    <Paragraph position="2"> Starting from this general proposal, a detailed and formal specification of the syntactic features to be encoded in the lexicon was developed at New York University in the fall of 1992. These specifications were presented at several meetings, at NYU, at the Univ. of Pennsylvania, and at New  Mexico State University, and form the basis for the project described here.</Paragraph>
    <Paragraph position="3"> 2. Structure of the Entries  Each entry is organized as a nested set of feature-value lists, using a Lisp-style notation. Each list consists of a type symbol followed by zero or more keyword-value pairs. Each value may in turn be an atom, a string, a list of strings, feature-value list, or a list of feature-value lists. This is similar in appearance to the typed feature structures which have been used in some other computer lexicons, although we have not yet made any significant use of the inheritance potential of these structures.</Paragraph>
    <Paragraph position="4"> Sample dictionary entries are shown in Figure 1. The first symbol gives the part of speech; a word with several parts of speech will have several dictionary entries, one for each part of speech. Each entry has an :orth feature, giving the base form of the word. Nouns, verbs, and adjectives with irregular morphology will have features for the irregular forms :plural, :past, :pastoart, etc. Words which take complements will have a subcategorization (:subc) feature. For example, the verb &amp;quot;abandon&amp;quot; can occur with a noun phrase followed by a prepositional phrase with the preposition &amp;quot;to&amp;quot; (e.g., &amp;quot;I abandoned him to the linguists.&amp;quot;) or with just a noun phrase complement (&amp;quot;I abandoned the ship.&amp;quot;). Other syntactic features are recorded under :features. For example, the noun &amp;quot;abandon&amp;quot; is marked as (countable :pval (&amp;quot;with&amp;quot;)), indicating that it must appear in the singular with a determiner unless  it is preceded by the preposition &amp;quot;with&amp;quot;.</Paragraph>
    <Paragraph position="5"> Other formats have been suggested for dictionary sharing, notably those developed under the Text Encoding Initiative using SGML (Standard Generalized Markup Language). We do not expect that it would be difficult to map the completed lexicon into one of these formats if desired. In addition, some dictionary standards require an entry for each inflected form, whereas COMLEX will have an entry for each base form (lemma). COMLEX has taken this approach in order to avoid having duplicate and possibly inconsistent information for different inflected forms (e.g., for subcategorization). It is straightforward, however, to &amp;quot;expand&amp;quot; the dictionary to have one entry for each inflected form.</Paragraph>
    <Paragraph position="6"> In addition to the information shown, each entry will have revision control information: information on by whom and when it was created, and by whom and when it was revised.</Paragraph>
    <Paragraph position="7"> We are also intending to include frequency information, initially just at the part-of-speech level, but eventually at the subcategorization frame level as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML