File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1046_intro.xml

Size: 6,929 bytes

Last Modified: 2025-10-06 14:01:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1046">
  <Title>LaTaT: Language and Text Analysis Tools</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> In natural language processing, syntactic and semantic knowledge are deeply intertwined with each other, both in their acquisition and usage. The goal of our research is to build a syntactic and semantic knowledge base through an iterative process that involves both language processing and language acquisition. We start the process by parsing a large corpus with a manually constructed parser that has only syntactic knowledge. We then extract lexical semantic and statistical knowledge from the parsed corpus, such as similar words and phrases, collocations and idiomatic expressions, and selectional preferences. In the second cycle, the text corpus is parsed again with the assistance of the newly acquired semantic and statistical knowledge, which allows the parser to better resolve systematic syntactic ambiguities, removing unlikely parts of speech. Our hypothesis is that this will result in higher quality parse trees, which in turn allows extraction of higher quality semantic and statistical knowledge in the second and later cycles.</Paragraph>
    <Paragraph position="1"> LaTaT is a Language and Text Analysis Toolset that demonstrates this iterative learning process. The main components in the toolset consist of the following: * A broad coverage English parser, called Minipar. The grammar is constructed manually, based on the Minimalist Program (Chomsky 1995). Instead of using a large number of CFG rules, Minipar achieves its broad coverage by using a small set of principles to constrain the overgerating X-bar schema; * A collocation extractor that extracts frequency counts of grammatical dependency relationships from a corpus parsed with Minipar. The frequency counts are then injected into Minipar to help it rank candidate parse trees; * A thesaurus constructor (Lin, 1998) that automatically computes the word similarities based on the distributional characteristics of words in the parsed corpus. The resulting word similarity database can then be used to smooth the probability distribution in statistical language models (Dagan et al, 1997);  network where nodes represent grammatical categories and links represent types of syntactic (dependency) relationships. The grammar network consists of 35 nodes and 59 links. Additional nodes and links are created dynamically to represent subcategories of verbs.</Paragraph>
    <Paragraph position="2"> Minipar employs a message passing algorithm that essentially implements distributed chart parsing. Instead of maintaining a single chart, each node in the grammar network maintains a chart containing partially built structures belonging to the grammatical category represented by the node. The grammatical principles are implemented as constraints associated with the nodes and links. The lexicon in Minipar is derived from WordNet (Miller, 1990). With additional proper names, the lexicon contains about 130,000 entries (in base form). The lexicon entry of a word lists all possible parts of speech of the word and its subcategorization frames (if any). The lexical ambiguities are handled by the parser instead of a tagger.</Paragraph>
    <Paragraph position="3"> Minipar works with a constituency grammar internally. However, the output of Minipar is a dependency tree. A dependency relationship is an asymmetric binary relationship between a word called head, and another word called modifier (Mel'cuk, 1987). The structure of a sentence can be represented by a set of dependency relationships that form a tree. A word in the sentence may have several modifiers, but each word may modify at most one word. The root of the dependency tree does not modify any word. It is also called the head of the sentence.</Paragraph>
    <Paragraph position="4"> Figure 1 shows an example dependency tree for the sentence John found a solution to the problem. The links in the diagram represent dependency relationships. The direction of a link is from the head to the modifier in the relationship. Labels associated with the links represent types of dependency relations. Table 1 lists a subset of the dependency relations in Minipar outputs.</Paragraph>
    <Paragraph position="5"> Minipar constructs all possible parses of an input sentence.</Paragraph>
    <Paragraph position="6"> However, only the highest ranking parse tree is outputted.</Paragraph>
    <Paragraph position="7"> Although the grammar is manually constructed, the selection of the best parse tree is guided by the statistical information obtained by parsing a 1GB corpus with Minipar. The statistical ranking of parse trees is based on the following probabilistic model. The probability of a dependency tree is defined as the product of the probabilities of the dependency relationships in the tree.</Paragraph>
    <Paragraph position="8"> Formally, given a tree T with root root consisting of D dependency relationships (head</Paragraph>
    <Paragraph position="10"> ) is obtained using Maximum Likelihood Estimation. Minipar parses newspaper text at about 500 words per second on a Pentium-III 700Mhz with 500MB memory. Evaluation with the manually parsed SUSANNE corpus (Sampson, 1995) shows that about 89% of the dependency relationships in Minipar outputs are correct.</Paragraph>
    <Paragraph position="11"> 3. Collocation and Word Similarity  We define a collocation to be a dependency relationship that occurs more frequently than predicted by assuming the two words in the relationship are independent of each other. Lin (1998) presented a method to create a collocation database by parsing a large corpus. Given a word w, the database can be used to retrieve all the dependency relationships involving w and the frequency counts of the dependency relationships. Table 2 shows excerpts of the entries in the collocation database for the words duty and responsibility. For example, in the corpus from which the collocation database is constructed, fiduciary duty occurs 319 times and assume [the] responsibility occurs 390 times.</Paragraph>
    <Paragraph position="12"> The collocation database entry of a given word can be viewed as a feature vector for that word. Similarity between words can be computed using the feature vectors. Intuitively, the more features that are shared between two words, the higher the similarity between the two words will be. This intuition is captured by the Distributional Hypothesis (Harris, 1985).</Paragraph>
    <Paragraph position="13"> Features of words are of varying degree of importance. For example, while almost any noun can be used as object of include, very few nouns can be modified by fiduciary. Two words sharing the feature object-of-include is less indicative of their similarity</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML