File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-6004_intro.xml
Size: 16,191 bytes
Last Modified: 2025-10-06 14:03:00
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-6004"> <Title>Integration of a Lexical Type Database with a Linguistically Interpreted Corpus</Title> <Section position="3" start_page="32" end_page="37" type="intro"> <SectionTitle> 3 Architecture of the Database </SectionTitle> <Paragraph position="0"> This section details the content of the database and the method of its construction. The database itself is on-line at http://pc1.ku-ntt-unet.ocn.ne.</Paragraph> <Paragraph position="1"> jp/tomcat/lextypeDB/.</Paragraph> <Section position="1" start_page="32" end_page="34" type="sub_section"> <SectionTitle> 3.1 Content of the Database </SectionTitle> <Paragraph position="0"> First of all, what information should be included in such a database to help treebank annotators and grammar developers to work consistently? Obviously, once we construct an electronic lexicon, whatever information it includes, we can easily see what lexical types are assumed in the grammar and treebank. But we have to carefully consider what to include in the database to make it clear how each of the lexical types are used and distinguished.</Paragraph> <Paragraph position="1"> We include five kinds of information: (3) Contents of the Database a. Linguistic discussion i Name ii Definition iii Criteria to judge a word as belonging to a given lexical type iv Reference to relevant literature b. Exemplification i Words that appear in a treebank ii Sentences in a treebank that contain the words c. Implementation i The portion of grammar source file that corresponds to the usage ii Comments related to the portion iii TODOs d. Links to &quot;confusing&quot; lexical types e. Links to other dictionaries That is, we describe each lexical type in depth (3a-3c) and present users (treebank annotators and grammar developers) explicit links to other lexical types that share homonymous words (3d) (e.g. adv-p-lex-1 vs ga-wo-ni-case-p-lex in (1)) to make it clear what distinguishes between them. Further, we present correspondences to other computational dictionaries (3e).</Paragraph> <Paragraph position="2"> Linguistic discussion To understand lexical types precisely, linguistic observations and analyses are a basic source of information. Firstly, the requirements for naming lexicaltypes in a computational system (3ai) are that they be short (so that they can be displayed in large trees) and easily distinguishable. Type names are not necessarily understandable for anyone but the developers, so it is useful to link them to more conventional names. For example ga-wo-ni-p-lex is a Case Particle (p %.V).</Paragraph> <Paragraph position="3"> Next, the definition field (3aii) contains a widely accepted definition statement of the lexical type. For example, ga-wo-ni-p-lex (1b) can be defined as &quot;a particle that indicates that a noun it attaches to functions as an argument of a predicate.&quot; Users can grasp the main characteristics from this.</Paragraph> <Paragraph position="4"> Thirdly, the criteria field (3aiii) provides users with means of investigating whether a given word belongs to the class. That is, it provides positive and negative usage examples. By such usage examples, developers can easily find differences among lexical types. For example, adv-p-lex-1 (1a) subcategorizes for nouns, while adv-p-lex-6 (2b) subcategorizes for adjectives. Sentences like (1a) and (2b) that fit such criteria should also be treebanked so that they can be used to test that the grammar covers what it claims. This is especially important for regression testing after new development.</Paragraph> <Paragraph position="5"> Finally, the reference field (3aiv) points to representative papers or books dealing with the lexical type. This allows the grammar developers to quickly check against existing analyses, and allows users as well to find more information. Exemplification Examples help users understand lexical types concretely. As we have constructed a treebank that is annotated with linguistic information, we can automatically extract relevant examples exhaustively. We give the database two kinds of examples: words, that are instances of the lexical types (3bi), and sentences, treebanked examples that contain the words (3bii). This link to the linguistically annotated corpus examples helps treebankers to check for consistency, and grammar developers to check that the lexical types are grounded in the corpus data. Implementation Grammar developers need to know the actual implementation of lexical types (3ci). Comments about the implementation (3cii) are also helpful to ascertain the current status. Although this section is necessarily frameworkdependent information, all project groups that are constructing detailed linguistic treebanks need to document this kind of information. We take our examples from JACY (Siegel and Bender, 2002), a large grammar of Japanese built in the HPSG framework. As actual implementations are generally incomplete, we use this resource to store notes about what remains to be done. TODOs (3ciii) should be explicitly stated to inform grammar developers of what they have to do next. We currently show the actual TDL definition, its parent type or types, category of the head (SYNSEM.LOCAL.CAT.HEAD), valency (SYNSEM.LOCAL.CAT.VAL), and the semantic type (SYNSEM.LOCAL.CONT).</Paragraph> <Paragraph position="6"> Links to &quot;confusing&quot; lexical types For users to distinguish phonologically identical but syntactically or semantically distinct words, it is important to link confusing lexical types to one another within the database. For example, the four lexical types in (1) and (2) are connected with each other in terms of ni. That way, users can compare those words in detail and make a reliable decision when trying to disambiguate usage examples.6 6Note that this information is not explicitly stored in the database. Rather, it is dynamically compiled from the database together with a lexicon database, one of the component databases explained below, when triggered by a user query. User queries are words like ni.</Paragraph> <Paragraph position="7"> Links to other dictionaries This information helps us to compare our grammar's treatment with that of other dictionaries. This comparison would then facilitate understanding of lexical types and extension of the lexicon. We currently link lexical types of our grammar to those of ChaSen (Matsumoto et al., 2000), the lexical type database that describes the lexical type, ga-wo-ni-p-lex.</Paragraph> </Section> <Section position="2" start_page="34" end_page="37" type="sub_section"> <SectionTitle> 3.2 Method of Database Construction </SectionTitle> <Paragraph position="0"> The next question is how to construct such a database. Needless to say, fully manual construction of the database is not realistic, since there are about 300 lexical types and more than 30,000 words in our grammar. In addition, we assume that we will refer to the database each time we annotate parser outputs to build the treebank and that we develop the grammar based on the treebanking result. Thus the database construction process must be quick enough not to delay the treebanking and grammar development cycles.</Paragraph> <Paragraph position="1"> To meet the requirement, our method of construction for the lexical type database is semiautomatic; most of the database content is constructed automatically, while the rest must be en- null tered manually. This is depicted in Figure 3.</Paragraph> <Paragraph position="2"> * Content that is constructed automatically - Lexical Type ID (Grammar DB) - Exemplification (3b) (Treebank DB) - Implementation (3ci,ii) (Grammar DB) - Link to &quot;confusing&quot; lexical types (3d) (Lexicon DB) - Link to Other Lexicons (3e) (OtherLex DB) * Content that is constructed manually - Linguistic discussion (3a) - TODOs (3ciii) To understand the construction process, description of the four databases that feed the lexical type database is in order. These are the grammar database, the treebank database, the lexicon database, and the OtherLex database.</Paragraph> <Paragraph position="3"> * The grammar database contains the actual implementation of the grammar, written as typed feature structures using TDL (Krieger and Schafer, 1994). Although it contains the whole implementation (lexical types, phrasal types, types for principles and so on), only lexical types are relevant to our task.</Paragraph> <Paragraph position="4"> * The lexicon database gives us mappings between words in the grammar, their orthography, and their lexical types. Thus we can see what words belong to a given lexical type.</Paragraph> <Paragraph position="5"> The data could be stored as TDL, but we use the Postgresql lexdb (Copestake et al., 2004), which simplifies access.</Paragraph> <Paragraph position="6"> * The treebank database stores all treebank information, including syntactic derivations, words, and the lexical type for each word.</Paragraph> <Paragraph position="7"> The main treebank is stored as structured text using the [incr tsdb()] (Oepen et al., 2002). We have also exported the derivation trees for the treebanked sentences into an SQL database for easy access. The leaves of the parse data consist of words, and their lexicon IDs, stored with the ID of the sentence in which the word appears.</Paragraph> <Paragraph position="8"> * We also use databases from other sources, such as ChaSen, Juman and Edict.</Paragraph> <Paragraph position="9"> Next we move on to describe the automatic construction. Firstly, we collect all lexical types assumed in the grammar and treebank from the grammar database. Each type constitutes the ID of a record of the lexical type database.</Paragraph> <Paragraph position="10"> Secondly, we extract words that are judged to belong to a given lexical type and sentences that contains the words (Example (3b)) from the tree-bank database compiled from the Hinoki tree-bank (Bond et al., 2004a). The parsed sentences can be seen in various forms: plain text, phrase structure trees, derivation trees, and minimal recursion semantics representations. We use components from the Heart-of-Gold middleware to present these as HTML (Callmeier et al., 2004). Thirdly, implementation information except for TODOs is extracted from the grammar database (3ci,ii).</Paragraph> <Paragraph position="11"> Fourthly, in order to establish &quot;confusing&quot; lexical type links (3d), we collect from the lexicon database homonyms of a word that users enter as a query. To be more precise, the lexicon database presents all the words with the same orthography as the query but belonging to different lexical types. These lexical types are then linked to each other as &quot;confusing&quot; in terms of the query word. Fifthly, we construct links between our lexical types and POS's of other lexicons such as ChaSen from OtherLex DB (3e). To do this, we prepare an interface (a mapping table) between our lexical type system and the other lexicon's POS system. As this is a finite mapping it could be made manually, but we semi-automate its construction. The similarity between types in the two databases (JACY and some other lexicon ) is calculated as the Dice coefficient, where W(LA) is the number of words W in lexical type L:</Paragraph> <Paragraph position="13"> The Dice coefficient was chosen because of its generality and ease of calculation. Any pair where sim(LA,LB) is above a threshold should potentially be mapped. The threshold must be set low, as the granularity of different systems can vary widely.</Paragraph> <Paragraph position="14"> Linguistic discussion (3a) and implementation TODOs (3ciii) have to be entered manually. Linguistic discussion is especially difficult to collect exhaustively since the task requires an extensive background in linguistics. We have several linguists in our group, and our achievements in this task owe much to them. We plan to make the interface open, and encourage the participation of anyone interested in the task.</Paragraph> <Paragraph position="15"> The on-line documentation is designed to complement the full grammar documentation (Siegel, 2004). The grammar documentation gives a top down view of the grammar, giving the overall motivation for the analyses. The lexical-type documentation gives bottom up documentation. It can easily be updated along with the grammar.</Paragraph> <Paragraph position="16"> Writing implementation TODOs also requires expertise in grammar development and linguistic background. But grammar developers usually take notes on what remains to be done for each lexical type anyway, so this is a relatively simple task.</Paragraph> <Paragraph position="17"> After the database is first constructed, how is it put to use and updated in the treebanking cycles described in Figure 1? Figure 4 illustrates this. Each time the grammar is revised based on treebank annotation feedback, grammar developers consult the database to see the current status of the grammar. After finishing the revision, the grammar and lexicon DBs are updated, as are the corresponding fields of the lexical type database. Each time the treebank is annotated, annotators can consult the database to make sure the chosen parse is correct. Following annotation, the tree-bank DB is updated, and so is the lexical type database. In parallel to this, collaborators who are In this section, we discuss some of the ways the database can benefit people other than treebank annotators and grammar developers.</Paragraph> <Paragraph position="18"> One way is by serving as a link to other lexical resources. As mentioned in the previous section, our database includes links to ChaSen, Juman, ALT-J/E, and EDICT. Currently, in Japanese NLP (and more generally), various lexical resources have been developed, but their intercorrespondences are not always clear. These lexical resources often play complementary roles, so synthesizing them seamlessly will make a Japanese lexicon with the widest and deepest knowledge ever. Among our plans is to realize this by means of the lexical type database. Consider Figure 5. Assuming that most lexical resources contain lexical type information, no matter how fine or coarse grained it is, it is natural to think that the lexical type database can act as a &quot;hub&quot; that links those lexical resources together. This will be achieved by preparing interfaces between the lexical type database and each of the lexical resources. Clearly, this is an intelligent way to synthesize lexical resources. Otherwise, we have to prepare nC2 interfaces to synthesize n resources.</Paragraph> <Paragraph position="19"> The problem is that construction of such an interface is time consuming. We need to further test generic ways to do this, such as with similarity scores, though we will not go on further with this issue in this paper.</Paragraph> <Paragraph position="20"> Apart from NLP, how can the database be used? In the short term our database is intended to provide annotators and grammar developers with a clear picture of the current status of the treebank and the grammar. In the long term, we expect to create successively better approximations of the Japanese language, as long as our deep linguistic broad coverage grammar describes Japanese syntax and semantics precisely. Consequently, the database would be of use to anyone who needs an accurate description of Japanese. Japanese language teachers can use its detailed descriptions of word usages, the links to other words, and the real examples from the treebank to show for students subtle differences among words that look the same but are grammatically different. Lexicographers can take advantage of its comprehensiveness and the real examples to compile a dictionary that contains full linguistic explanations. The confidence in the linguistic descriptions is based on the combination of the precise grammar linked to the detailed treebank. Each improves the other through the treebank annotation and grammar development cycle as depicted in Figure 1.</Paragraph> </Section> </Section> class="xml-element"></Paper>