File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/w00-1204_abstr.xml

Size: 11,413 bytes

Last Modified: 2025-10-06 13:41:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1204">
  <Title>Using Co-occurrence Statistics as an Information Source for Partial Parsing of Chinese</Title>
  <Section position="1" start_page="0" end_page="24" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Our partial parser for Chinese uses a learned classifier to guide a bottom-up parsing process. We describe improvements in performance obtained by expanding the information available to the classifier, from POS sequences only, to include measures of word association derived from co-occurrence statistics. We compare performance using different measures of association, and find that Yule's coefficient of colligation Y gives somewhat better results over other measures.</Paragraph>
    <Paragraph position="1"> Introduction In learning-based approaches to syntactic parsing, the earliest models developed generally ignored the individual identities of words, making decisions based only on their part-of-speech classes. On the othor hand, many later models see each word as a monolithic entity, with parameters estimated separately for each word type. In between have been models which auempt to generalize by considering similarity between words, where knowledge about similarity is deduced fi'om hand-written sources (e.g. thesauri), or induced from text. For example, The SPATTER parser (Magerman, 1995) makes use of the output of a clustering algorithm based on co-occurrence information. Because this co-occurrence information can be derived from inexpensive data with a minimum of pre-processing, it can be very inclusive and informative about even relatively rare words, thus increasing the generalization capability of the parser trained on a much smaller fully annotated corpus.</Paragraph>
    <Paragraph position="2"> The cunent work is in this spirit, making complementary use of a relatively small treebank for syntactic information and a relatively large collection of flat text for co-occurrence information. However, we do not use any kind of clustering, instead using the co-occurrence data directly. Our parser is a bottom-up parser whose actions are guided by a machine-learning-based decision-making module (we use the SNoW learner developed at the University of Illinois, Urbana..Champaign (Roth, 1998) for its strength with potentially very large feature sets and for its ease of use). The learner is able to directly use statistics derived from the co-occu~euce data to guide its decisions.</Paragraph>
    <Paragraph position="3"> We collect a variety of statistical measures of association based on bigram co-occurrence data (specifically, mutual information, t-score, X 2, likelihood ratio and Yule's coefficient of colligation Y), and make the statistics available to the decision-making module. We use labelled constituent precision and recall to compare performance of different versions of our parser on unseen test data. We observe a marked improvement in some of the versions using the co-occurrence data, with strongest performance observed in the versions using Yule's coefficient of colligation Y and mutual information, and more modest improvements in those using the other measures.</Paragraph>
    <Paragraph position="4">  The current work has developed in the context of developing a partial or &amp;quot;chunk&amp;quot; parser for Chinese, whose task is to identify certain kinds of local syntactic structure. The syntactic  analysis we use largely follows the outline of Steven Abney's work (Abney, 1994). We adopt the concept of a &amp;quot;e-head&amp;quot; and an &amp;quot;s-head&amp;quot; for each phrase, where the e-head corresponds roughly to the generally used concept of head (e.g., the main verb in a verb phrase, or the preposition in a prepositional phrase), and the s-bead is the &amp;quot;main content word&amp;quot; of a phrase (e.g., the main verb in a verb phrase, but the object of the preposition in a prepositional phrase). The core of our chunk definition is also in line with Abney's: A chunk is essentially the contiguous range of words s-headed by a given major content word. Within this basic framework, we make some aecorunaodations to the Chinese language and to practicality. For example, by our understanding of Abney's definition, a numeral-classifier phrase followed immediately by the noun it modifies should constitute two separate chunks. However such units seem likely to be useful in further processing, and easy to accurately identify, so we chose to include them in our definition of chunk.</Paragraph>
    <Paragraph position="5"> For simplicity and consistency, we adopt a very restricted phrase-structured syntactic formalism, somewhat similar to a phrase-structured formulation of a dependency grammar. In our formalism, all constituents are bina_ry branching, and the purpose of the non-terminal labels is restricted to indicating the direction of dependency between the two children. Figure 1 shows an example sentence with some indicative structures.</Paragraph>
    <Paragraph position="6"> Dependencies within individual chunks are shown with heavy arrows. A fight-pointing dependency, such as the three dependencies within the noun phrase &amp;quot;)l~.~t:~y~r~&amp;quot;, corresponds to a constituent labelled &amp;quot;right-headed&amp;quot;. A left-pointing dependency, such as that between the verb &amp;quot;~,.~\[~&amp;quot; and its aspect particle &amp;quot;T', corresponds to a constituent labelled &amp;quot;left-headed&amp;quot;. These are cases where the s-head and the e-head of the phrase are identical. When they are not identical, we have a &amp;quot;two-headed&amp;quot; dependency, like those in the phrase &amp;quot;~_L\[~'. Here, the relation between &amp;quot;~&amp;quot; and &amp;quot;.J~&amp;quot; (and between &amp;quot;~ _.L&amp;quot; and &amp;quot;~&amp;quot;) is that the left constituent provides the s-head of the phrase, while the right constituent provides the e-head.</Paragraph>
    <Paragraph position="7"> These four non-terminal categories can descn'be high- or low- level syntactic structures. However, for chunking we wish to leave the higher-level structures of a sentence unspecified, leaving only a list of local structares. We treat this in a consistent way by adding a fifth non-terminM category &amp;quot;unspecified&amp;quot;, and replacing all higher str~tures with a backbone of strictly left-branching &amp;quot;unspecified&amp;quot; nodes, anchored to a special &amp;quot;wall&amp;quot; token to the left of the sentence. This backbone structure is shown by the light lines in the figure.</Paragraph>
    <Section position="1" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
1.2 Our Data Sources ~ One Large and One
Small
</SectionTitle>
      <Paragraph position="0"> During development, we made use of two corpora. The first is a relatively small-scale treebank of approximately 3500 sentences, 39,000 words, and 55,000 characters (Zhou, 1996). We transformed this corpus by annotating each phrase with c-heads and s-heads, using a large collection of hand-written rules, and then extracted chunks from this transformed version.</Paragraph>
      <Paragraph position="1"> The second corpus, which we use only as a source of co-occurrence statistics, is much larger, with approximately 67,000 sentences, 1.5 million words, and 2.2 million characters, with sentences separated and words separated and marked with parts-of-speech, but with no further syntactic annotation (Zhou and Sun, 1999). In the current work we make no use of the part-of-speech annotation, taking co-occurrence * counts of word-types alone.</Paragraph>
    </Section>
    <Section position="2" start_page="22" end_page="24" type="sub_section">
      <SectionTitle>
1.3 Our Framework -- Classifier-Guided
Shift-Reduce Parsing
</SectionTitle>
      <Paragraph position="0"> The parsing framework we use has been chosen for maximum simplicity, aided by the simplicity of the syntactic framework. In parsing, we model a left-to-right shift-reduce automaton which builds a parse-tree constituent-by-constituent in a deterministic left--to-right process. The parsing process is thus reduced to making a series of decisions of exactly what to build.</Paragraph>
      <Paragraph position="1"> For training, we extract the series of actions the shift-reduce parser would have had to make to produce the trees from the surface structure of the sentences. This gives a long series of state-action pairs: &amp;quot;when the parser was in state X, it took action Y'. The state description X is set of binary predicates describing the local surface structure of the sentence and the contents  of the stack. We describe these predicates in detail below. This series of state-action pairs is presented to the SNoW learner, which tries to learn to predict the parser actions from the parser states, attempting to find a linear diserimin:mt over these binary predicates which best accounts for the corresponding actions in the training data.</Paragraph>
      <Paragraph position="2"> These parse actions can be either &amp;quot;shift a word from the right on to the stack&amp;quot;, or &amp;quot;reduce the * , top elements of the stack&amp;quot; into a single constituent. Because our syntactic framework is strictly binary branching, each reduce action operates on exactly the top two items on the stack, so the automaton need only choose a category for the new constituent. This decision turns out to be nearly trivial, and we were able to achieve 100% accuracy on our test set using only part-of-speech information, so in the remainder of this paper we discuss only issues relating to the more difficult decision of whether to shift or reduce.</Paragraph>
      <Paragraph position="3"> Within the shift-reduce decisions, over half are pre&lt;letermined by the basic requirements of the framework. For example, if there are no words left to shift, we can only reduce. If there is only one item on the stack, we can only shift. These decisions are handled by simple deterministic rules within the parser and are not shown to the classifier either in training or in parsing. In the first version of the parser, prior to the introduction of co-occurrence statistics, the information available to the classifier is limited to parts-of-speech of words in the surface structure of the sentence, nonterminal categories of constituents already built on the stack, and parts-of-speech of the s- and e-heads of constituents already built on the stack. These are collected into schemas representing sets of poss~le binary predicates. Table 1 shows a representative subset of this original set of 18 predicate schemas (space does not allow us to present all of them). The total of all the instantiations of all these templates presents a potentially huge feature set, so we rely on an important property of the SNoW architecture, that it can handle an indefinitely large set of</Paragraph>
      <Paragraph position="5"> and t2 range over the set of part-of-speech categories, while the variables c, et, and c2 range over the set of non-terminal categories. Surface words are indexed relative to the parsing position, such that Surface-word\[O\] is the next word to be shifted.</Paragraph>
      <Paragraph position="6"> features, actually using only those features which are active. The set of these actually active features is reasonable for our set of schemas.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML